bcachefs.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2022-11-09	xfs: check btree keys reflect the child block	Darrick J. Wong
	When scrub is checking a non-root btree block, it should make sure that the keys in the parent btree block accurately capture the keyspace that the child block stores. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-11-09	xfs: detect unwritten bit set in rmapbt node block keys	Darrick J. Wong
	In the last patch, we changed the rmapbt code to remove the UNWRITTEN bit when creating an rmapbt key from an rmapbt record, and we changed the rmapbt key comparison code to start considering the ATTR and BMBT flags during lookup. This brought the behavior of the rmapbt implementation in line with its specification. However, there may exist filesystems that have the unwritten bit still set in the rmapbt keys. We should detect these situations and flag the rmapbt as one that would benefit from optimization. Eventually, online repair will be able to do something in response to this. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: fix rm_offset flag handling in rmap keys	Darrick J. Wong
	Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows: (physical block, owner, fork, is_btree, offset) This provides users the ability to look up a reverse mapping from a file block mapping record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset. Unfortunately, the code that creates rmap lookup keys from rmap records forgot to mask off the record attribute flags, leading to ondisk keys that look like this: (physical block, owner, fork, is_btree, unwritten state, offset) Fortunately, this has all worked ok for the past six years because the key comparison functions incorrectly ignore the fork/bmbt/unwritten information that's encoded in the on-disk offset. This means that lookup comparisons are only done with: (physical block, owner, offset) Queries can (theoretically) return incorrect results because of this omission. On consistent filesystems this isn't an issue because xattr and bmbt blocks cannot be shared and hence the comparisons succeed purely on the contents of the rm_startblock field. For the one case where we support sharing (written data fork blocks) all flag bits are zero, so the omission in the comparison has no ill effects. Unfortunately, this bug prevents scrub from detecting incorrect fork and bmbt flag bits in the rmap btree, so we really do need to fix the compare code. Old filesystems with the unwritten bit erroneously set in the rmap key struct will work fine on new kernels since we still ignore the unwritten bit. New filesystems on older kernels will work fine since the old kernels never paid attention to the unwritten bit. A previous version of this patch forgot to keep the (un)written state flag masked during the comparison and caused a major regression in 5.9.x since unwritten extent conversion can update an rmap record without requiring key updates. Note that blocks cannot go directly from data fork to attr fork without being deallocated and reallocated, nor can they be added to or removed from a bmbt without a free/alloc cycle, so this should not cause any regressions. Found by fuzzing keys[1].attrfork = ones on xfs/371. Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: hoist inode record alignment checks from scrubbtree-hoist-scrub-checks_2022-11-09	Darrick J. Wong
	Move the inobt record alignment checks from xchk_iallocbt_rec into xfs_inobt_check_irec so that they are applied everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: hoist rmap record flag checks from scrub	Darrick J. Wong
	Move the rmap record flag checks from xchk_rmapbt_rec into xfs_rmap_check_irec so that they are applied everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: hoist rmap record flag checks from scrub	Darrick J. Wong
	Move the rmap record flag checks from xchk_rmapbt_rec into xfs_rmap_check_irec so that they are applied everywhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: complain about bad file mapping records in the ondisk bmbtbtree-complain-bad-records_2022-11-09	Darrick J. Wong
	Similar to what we've just done for the other btrees, create a function to log corrupt bmbt records and call it whenever we encounter a bad record in the ondisk btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: complain about bad records in query_range helpers	Darrick J. Wong
	For every btree type except for the bmbt, refactor the code that complains about bad records into a helper and make the ->query_range helpers call it so that corruptions found via that avenue are logged. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: standardize ondisk to incore conversion for bmap btrees	Darrick J. Wong
	Fix all xfs_bmbt_disk_get_all callsites to call xfs_bmap_validate_extent and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: standardize ondisk to incore conversion for rmap btrees	Darrick J. Wong
	Create a xfs_rmap_check_irec function to detect corruption in btree records. Fix all xfs_rmap_btrec_to_irec callsites to call the new helper and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: return a failure address from xfs_rmap_irec_offset_unpack	Darrick J. Wong
	Currently, xfs_rmap_irec_offset_unpack returns only 0 or -EFSCORRUPTED. Change this function to return the code address of a failed conversion in preparation for the next patch, which standardizes localized record checking and reporting code. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: standardize ondisk to incore conversion for refcount btrees	Darrick J. Wong
	Create a xfs_refcount_check_irec function to detect corruption in btree records. Fix all xfs_refcount_btrec_to_irec callsites to call the new helper and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: standardize ondisk to incore conversion for inode btrees	Darrick J. Wong
	Create a xfs_inobt_check_irec function to detect corruption in btree records. Fix all xfs_inobt_btrec_to_irec callsites to call the new helper and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: standardize ondisk to incore conversion for free space btrees	Darrick J. Wong
	Create a xfs_alloc_btrec_to_irec function to convert an ondisk record to an incore record, and a xfs_alloc_check_irec function to detect corruption. Replace all the open-coded logic with calls to the new helpers and bubble up corruption reports. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: scrub should use ECHRNG to signal that the drain is neededscrub-drain-intents_2022-11-09	Darrick J. Wong
	In the previous patch, we added jump labels to the intent drain code so that regular filesystem operations need not pay the price of checking for someone (scrub) waiting on intents to drain from some part of the filesystem when that someone isn't running. However, I observed that xfs/285 now spends a lot more time pushing the AIL from the inode btree scrubber than it used to. This is because the inobt scrubber will try push the AIL to try to get logged inode cores written to the filesystem when it sees a weird discrepancy between the ondisk inode and the inobt records. This AIL push is triggered when the setup function sees TRY_HARDER is set; and the requisite EDEADLOCK return is initiated when the discrepancy is seen. The solution to this performance slow down is to use a different result code (ECHRNG) for scrub code to signal that it needs to wait for deferred intent work items to drain out of some part of the filesystem. When this happens, set a new scrub state flag (XCHK_NEED_DRAIN) so that setup functions will activate the jump label. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: minimize overhead of drain wakeups by using jump labels	Darrick J. Wong
	To reduce the runtime overhead even further when online fsck isn't running, use a static branch key to decide if we call wake_up on the drain. For compilers that support jump labels, the call to wake_up is replaced by a nop sled when nobody is waiting for intents to drain. From my initial microbenchmarking, every transition of the static key between the on and off states takes about 22000ns to complete; this is paid entirely by the xfs_scrub process. When the static key is off (which it should be when fsck isn't running), the nop sled adds an overhead of approximately 0.36ns to runtime code. For the few compilers that don't support jump labels, runtime code pays the cost of calling wake_up on an empty waitqueue, which was observed to be about 30ns. However, most architectures that have sufficient memory and CPU capacity to run XFS also support jump labels, so this is not much of a worry. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: clean up scrub context if scrub setup returns -EDEADLOCK	Darrick J. Wong
	It has been a longstanding convention that online scrub and repair functions can return -EDEADLOCK to signal that they weren't able to obtain some necessary resource. When this happens, the scrub framework is supposed to release all resources attached to the scrub context, set the TRY_HARDER flag in the scrub context flags, and try again. In this context, individual scrub functions are supposed to take all the resources they (incorrectly) speculated were not necessary. We're about to make it so that the functions that lock and wait for a filesystem AG can also return EDEADLOCK to signal that we need to try again with the drain waiters enabled. Therefore, refactor xfs_scrub_metadata to support this behavior for ->setup() functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: allow queued AG intents to drain before scrubbing	Darrick J. Wong
	When a writer thread executes a chain of log intent items, the AG header buffer locks will cycle during a transaction roll to get from one intent item to the next in a chain. Although scrub takes all AG header buffer locks, this isn't sufficient to guard against scrub checking an AG while that writer thread is in the middle of finishing a chain because there's no higher level locking primitive guarding allocation groups. When there's a collision, cross-referencing between data structures (e.g. rmapbt and refcountbt) yields false corruption events; if repair is running, this results in incorrect repairs, which is catastrophic. Fix this by adding to the perag structure the count of active intents and make scrub wait until it has both AG header buffer locks and the intent counter reaches zero. One quirk of the drain code is that deferred bmap updates also bump and drop the intent counter. A fundamental decision made during the design phase of the reverse mapping feature is that updates to the rmapbt records are always made by the same code that updates the primary metadata. In other words, callers of bmapi functions expect that the bmapi functions will queue deferred rmap updates. Some parts of the reflink code queue deferred refcount (CUI) and bmap (BUI) updates in the same head transaction, but the deferred work manager completely finishes the CUI before the BUI work is started. As a result, the CUI drops the intent count long before the deferred rmap (RUI) update even has a chance to bump the intent count. The only way to keep the intent count elevated between the CUI and RUI is for the BUI to bump the counter until the RUI has been created. A second quirk of the intent drain code is that deferred work items must increment the intent counter as soon as the work item is added to the transaction. When a BUI completes and queues an RUI, the RUI must increment the counter before the BUI decrements it. The only way to accomplish this is to require that the counter be bumped as soon as the deferred work item is created in memory. In the next patches we'll improve on this facility, but this patch provides the basic functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: add a tracepoint to report incorrect extent refcounts	Darrick J. Wong
	Add a new tracepoint so that I can see exactly what and where we failed the refcount check. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: create a function to duplicate an active perag referencepass-perag-refs_2022-11-09	Darrick J. Wong
	There a few object constructor functions throughout XFS where a caller provides an active perag reference and the constructor wants to give the new object its own active reference. Replace the open-coded logic with a common function to do this instead of open-coding atomic_inc logic. This new function adds a few safeguards -- it checks that there's at least one active reference to the perag structure passed in, and it records the refcount bump in the ftrace information. This makes it much easier to debug refcounting problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: give xfs_refcount_intent its own perag referenceintents-perag-refs_2022-11-09	Darrick J. Wong
	Give the xfs_refcount_intent an active reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. Later, shrink will use these active references to know if an AG is quiesced or not. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: give xfs_rmap_intent its own perag reference	Darrick J. Wong
	Give the xfs_rmap_intent an active reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. Later, shrink will use these active references to know if an AG is quiesced or not. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: give xfs_extfree_intent its own perag reference	Darrick J. Wong
	Give the xfs_extfree_intent an active reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. Later, shrink will use these active references to know if an AG is quiesced or not. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: pass per-ag references to xfs_free_extent	Darrick J. Wong
	Pass a reference to the per-AG structure to xfs_free_extent. Most callers already have one, so we can eliminate unnecessary lookups. The one exception to this is the EFI code, which the next patch will fix. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: give xfs_bmap_intent its own perag reference	Darrick J. Wong
	Give the xfs_bmap_intent an active reference to the perag structure data. This reference will be used to enable scrub intent draining functionality in subsequent patches. Later, shrink will use these active references to know if an AG is quiesced or not. The reason why we take an active ref for a file mapping operation is simple: we're committing to some sort of action involving space in an AG, so we want to indicate our active interest in that AG. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: fix confusing variable names in xfs_refcount_item.cintents-naming-cleanups_2022-11-09	Darrick J. Wong
	Variable names in this code module are inconsistent and confusing. xfs_phys_extent describe physical mappings, so rename them "pmap". xfs_refcount_intents describe refcount intents, so rename them "ri". Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: pass refcount intent directly through the log intent code	Darrick J. Wong
	Pass the incore refcount intent through the CUI logging code instead of repeatedly boxing and unboxing parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: fix confusing variable names in xfs_rmap_item.c	Darrick J. Wong
	Variable names in this code module are inconsistent and confusing. xfs_map_extent describe file mappings, so rename them "map". xfs_rmap_intents describe block mapping intents, so rename them "ri". Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: pass rmap space mapping directly through the log intent code	Darrick J. Wong
	Pass the incore rmap space mapping through the RUI logging code instead of repeatedly boxing and unboxing parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: fix confusing xfs_extent_item variable names	Darrick J. Wong
	Change the name of all pointers to xfs_extent_item structures to "xefi" to make the name consistent and because the current selections ("new" and "free") mean other things in C. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: pass xfs_extent_free_item directly through the log intent code	Darrick J. Wong
	Pass the incore xfs_extent_free_item through the EFI logging code instead of repeatedly boxing and unboxing parameters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: fix confusing variable names in xfs_bmap_item.c	Darrick J. Wong
	Variable names in this code module are inconsistent and confusing. xfs_map_extent describe file mappings, so rename them "map". xfs_bmap_intents describe block mapping intents, so rename them "bi". Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: pass the xfs_bmbt_irec directly through the log intent code	Darrick J. Wong
	Instead of repeatedly boxing and unboxing the incore extent mapping structure as it passes through the BUI code, pass the pointer directly through. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: don't return -EFSCORRUPTED from repair when resources cannot be grabbedxfs-6.2-fixes_2022-11-09	Darrick J. Wong
	If we tried to repair something but the repair failed with -EDEADLOCK, that means that the repair function couldn't grab some resource it needed and wants us to try again. If we try again (with TRY_HARDER) but still can't get all the resources we need, the repair fails and errors remain on the filesystem. Right now, repair returns the -EDEADLOCK to the caller as -EFSCORRUPTED, which results in XFS_SCRUB_OFLAG_CORRUPT being passed out to userspace. This is not correct because repair has not determined that anything is corrupt. If the repair had been invoked on an object that could be optimized but wasn't corrupt (OFLAG_PREEN), the inability to grab resources will be reported to userspace as corrupt metadata, and users will be unnecessarily alarmed that their suboptimal metadata turned into a corruption. Fix this by returning zero so that the results of the actual scrub will be copied back out to userspace. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: shut up -Wuninitialized in xfsaild_push	Darrick J. Wong
	-Wuninitialized complains about @target in xfsaild_push being uninitialized in the case where the waitqueue is active but there is no last item in the AIL to wait for. I /think/ it should never be the case that the subsequent xfs_trans_ail_cursor_first returns a log item and hence we'll never end up at XFS_LSN_CMP, but let's make this explicit. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: use memcpy, not strncpy, to format the attr prefix during listxattr	Darrick J. Wong
	When -Wstringop-truncation is enabled, the compiler complains about truncation of the null byte at the end of the xattr name prefix. This is intentional, since we're concatenating the two strings together and do _not_ want a null byte in the middle of the name. We've already ensured that the name buffer is long enough to handle prefix and name, and the prefix_len is supposed to be the length of the prefix string without the null byte, so use memcpy here instead. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document future directions of online fsckonline-fsck-design_2022-11-09	Darrick J. Wong
	Add the seventh and final chapter of the online fsck documentation, where we talk about future functionality that can tie in with the functionality provided by the online fsck patchset. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document the userspace fsck driver program	Darrick J. Wong
	Add the sixth chapter of the online fsck design documentation, where we discuss the details of the data structures and algorithms used by the driver program xfs_scrub. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document directory tree repairs	Darrick J. Wong
	Directory tree repairs are the least complete part of online fsck, due to the lack of directory parent pointers. However, even without that feature, we can still make some corrections to the directory tree -- we can salvage as many directory entries as we can from a damaged directory, and we can reattach orphaned inodes to the lost+found, just as xfs_repair does now. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document metadata file repair	Darrick J. Wong
	File-based metadata (such as xattrs and directories) can be extremely large. To reduce the memory requirements and maximize code reuse, it is very convenient to create a temporary file, use the regular dir/attr code to store salvaged information, and then atomically swap the extents between the file being repaired and the temporary file. Record the high level concepts behind how temporary files and atomic content swapping should work, and then present some case studies of what the actual repair functions do. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document full filesystem scans for online fsck	Darrick J. Wong
	Certain parts of the online fsck code need to scan every file in the entire filesystem. It is not acceptable to block the entire filesystem while this happens, which means that we need to be clever in allowing scans to coordinate with ongoing filesystem updates. We also need to hook the filesystem so that regular updates propagate to the staging records. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document online file metadata repair code	Darrick J. Wong
	Add to the fifth chapter of the online fsck design documentation, where we discuss the details of the data structures and algorithms used by the kernel to repair file metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document btree bulk loading	Darrick J. Wong
	Add a discussion of the btree bulk loading code, which makes it easy to take an in-memory recordset and write it out to disk in an efficient manner. This also enables atomic switchover from the old to the new structure with minimal potential for leaking the old blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document pageable kernel memory	Darrick J. Wong
	Add a discussion of pageable kernel memory, since online fsck needs quite a bit more memory than most other parts of the filesystem to stage records and other information. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document how online fsck deals with eventual consistency	Darrick J. Wong
	Writes to an XFS filesystem employ an eventual consistency update model to break up complex multistep metadata updates into small chained transactions. This is generally good for performance and scalability because XFS doesn't need to prepare for enormous transactions, but it also means that online fsck must be careful not to attempt a fsck action unless it can be shown that there are no other threads processing a transaction chain. This part of the design documentation covers the thinking behind the consistency model and how scrub deals with it. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document the filesystem metadata checking strategy	Darrick J. Wong
	Begin the fifth chapter of the online fsck design documentation, where we discuss the details of the data structures and algorithms used by the kernel to examine filesystem metadata and cross-reference it around the filesystem. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document the user interface for online fsck	Darrick J. Wong
	Start the fourth chapter of the online fsck design documentation, which discusses the user interface and the background scrubbing service. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document the testing plan for online fsck	Darrick J. Wong
	Start the third chapter of the online fsck design documentation. This covers the testing plan to make sure that both online and offline fsck can detect arbitrary problems and correct them without making things worse. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document the general theory underlying online fsck design	Darrick J. Wong
	Start the second chapter of the online fsck design documentation. This covers the general theory underlying how online fsck works. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09	xfs: document the motivation for online fsck design	Darrick J. Wong
	Start the first chapter of the online fsck design documentation. This covers the motivations for creating this in the first place. Signed-off-by: Darrick J. Wong <djwong@kernel.org>