summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-10-14xfs: zap broken inode forksDarrick J. Wong
Determine if inode fork damage is responsible for the inode being unable to pass the ifork verifiers in xfs_iget and zap the fork contents if this is true. Once this is done the fork will be empty but we'll be able to construct an in-core inode, and a subsequent call to the inode fork repair ioctl will search the rmapbt to rebuild the records that were in the fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: repair inode recordsDarrick J. Wong
If an inode is so badly damaged that it cannot be loaded into the cache, fix the ondisk metadata and try again. If there /is/ a cached inode, fix any problems and apply any optimizations that can be solved incore. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: try to attach dquots to files before repairing themDarrick J. Wong
Soon, we will be adding the ability to repair inodes. Inode resource usage is tracked in quota, which means that if we think we might have to repair a file, we ought to attach dquots from the start. Do this before we take the file's ILOCK, though we don't require success here because quota itself could also be in need of repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: disable online repair quota helpers when quota not enabledDarrick J. Wong
Don't compile the quota helper functions if quota isn't being built into the XFS module. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: repair refcount btreesrepair-ag-btrees_2022-10-14Darrick J. Wong
Reconstruct the refcount data from the rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: repair inode btreesDarrick J. Wong
Use the rmapbt to find inode chunks, query the chunks to compute hole and free masks, and with that information rebuild the inobt and finobt. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: rewrite xfs_icache_inode_is_allocatedDarrick J. Wong
Back in the mists of time[1], I proposed this function to assist the inode btree scrubbers in checking the inode btree contents against the allocation state of the inode records. The original version performed a direct lookup in the inode cache and returned the allocation status if the cached inode hadn't been reused and wasn't in an intermediate state. Brian thought it would be better to use the usual iget/irele mechanisms, so that was changed for the final version. Unfortunately, this hasn't aged well -- the IGET_INCORE flag only has one user and clutters up the regular iget path, which makes it hard to reason about how it actually works. Worse yet, the inode inactivation series silently broke it because iget won't return inodes that are anywhere in the inactivation machinery, even though the caller is already required to prevent inode allocation and freeing. Inodes in the inactivation machinery are still allocated, but the current code's interactions with the iget code prevent us from being able to say that. Now that I understand the inode lifecycle better than I did in early 2017, I now realize that as long as the cached inode hasn't been reused and isn't actively being reclaimed, it's safe to access the i_mode field (with the AGI, rcu, and i_flags locks held), and we don't need to worry about the inode being freed out from under us. Therefore, port the original version to modern code structure, which fixes the brokennes w.r.t. inactivation. In the next patch we'll remove IGET_INCORE since it's no longer necessary. [1] https://lore.kernel.org/linux-xfs/149643868294.23065.8094890990886436794.stgit@birch.djwong.org/ Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: repair free space btreesDarrick J. Wong
Rebuild the free space btrees from the gaps in the rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: clear pagf_agflreset when repairing the AGFLDarrick J. Wong
Clear the pagf_agflreset flag when we're repairing the AGFL because we fix all the same padding problems that xfs_agfl_reset does. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: allow userspace to rebuild metadata structuresrepair-force-rebuild_2022-10-14Darrick J. Wong
Add a new (superuser-only) flag to the online metadata repair ioctl to force it to rebuild structures, even if they're not broken. We will use this to move metadata structures out of the way during a free space defragmentation operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: don't complain about unfixed metadata when repairs were injectedDarrick J. Wong
While debugging other parts of online repair, I noticed that if someone injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of metadata, and the metadata repair fails, we'll log a message about uncorrected errors in the filesystem. This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT and we're only doing the repair because the error injection knob is set. Repair functions are allowed to abort the entire operation at any point before committing new metadata, in which case the piece of metadata is in the same state as it was before. Therefore, the log message should be gated on the results of the scrub. Refactor the predicate and rearrange the code flow to make this happen. Note: If the repair function errors out after it commits the new metadata, the transaction cancellation will shut down the filesystem, which is an obvious sign of corrupt metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: allow the user to cancel repairs before we start writingrepair-tweaks_2022-10-14Darrick J. Wong
All online repair functions have the same structure: walk filesystem metadata structures gathering enough data to rebuild the structure, stage a new copy, and then commit the new copy. The gathering steps do not write anything to disk, so they are peppered with xchk_should_terminate calls to avoid softlockup warnings and to provide an opportunity to abort the repair (by killing xfs_scrub). However, it's not clear in the code base when is the last chance to abort cleanly without having to undo a bunch of structure. Therefore, add one more call to xchk_should_terminate (along with a comment) providing the sysadmin with the ability to abort before it's too late and to make it clear in the source code when it's no longer convenient or safe to abort a repair. As there are only four repair functions right now, this patch exists more to establish a precedent for subsequent additions than to deliver practical functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: always rescan allegedly healthy per-ag metadata after repairDarrick J. Wong
After an online repair function runs for a per-AG metadata structure, sc->sick_mask is supposed to reflect the per-AG metadata that the repair function fixed. Our next move is to re-check the metadata to assess the completeness of our repair, so we don't want the rebuilt structure to be excluded from the rescan just because the health system previously logged a problem with the data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: implement online scrubbing of rtsummary infoscrub-rtsummary_2022-10-14Darrick J. Wong
Finish the realtime summary scrubber by adding the functions we need to compute a fresh copy of the rtsummary info and comparing it to the copy on disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: move the realtime summary file scrubber to a separate source fileDarrick J. Wong
Move the realtime summary file checking code to a separate file in preparation to actually implement it. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: wrap ilock/iunlock operations on sc->ipDarrick J. Wong
Scrub tracks the resources that it's holding onto in the xfs_scrub structure. This includes the inode being checked (if applicable) and the inode lock state of that inode. Replace the open-coded structure manipulation with a trivial helper to eliminate sources of error. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: get our own reference to inodes that we want to scrubDarrick J. Wong
When we want to scrub a file, get our own reference to the inode unconditionally. This will make disposal rules simpler in the long run. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: improve xfarray quicksort pivotbig-array_2022-10-14Darrick J. Wong
Now that we have the means to do insertion sorts of small in-memory subsets of an xfarray, use it to improve the quicksort pivot algorithm by reading 7 records into memory and finding the median of that. This should prevent bad partitioning when a[lo] and a[hi] end up next to each other in the final sort, which can happen when sorting for cntbt repair when the free space is extremely fragmented (e.g. generic/176). This doesn't speed up the average quicksort run by much, but it will (hopefully) avoid the quadratic time collapse for which quicksort is famous. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cache pages used for xfarray quicksort convergenceDarrick J. Wong
After quicksort picks a pivot item for a particular subsort, it walks the records in that subset from the outside in, rearranging them so that every record less than the pivot comes before it, and every record greater than the pivot comes after it. This scan has a lot of locality, so we can speed it up quite a bit by grabbing the xfile backing page and holding onto it as long as we possibly can. Doing so reduces the runtime by another 5% on the author's computer. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: speed up xfarray sort by sorting xfile page contents directlyDarrick J. Wong
If all the records in an xfarray subset live within the same memory page, we can short-circuit even more quicksort recursion by mapping that page into the local CPU and using the kernel's heapsort function to sort the subset. On the author's computer, this reduces the runtime by another 15% on a 500,000 element array. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: teach xfile to pass back direct-map pages to callerDarrick J. Wong
Certain xfile array operations (such as sorting) can be sped up quite a bit by allowing xfile users to grab a page to bulk-read the records contained within it. Create helper methods to facilitate this. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: convert xfarray insertion sort to heapsort using scratchpad memoryDarrick J. Wong
In the previous patch, we created a very basic quicksort implementation for xfile arrays. While the use of an alternate sorting algorithm to avoid quicksort recursion on very small subsets reduces the runtime modestly, we could do better than a load and store-heavy insertion sort, particularly since each load and store requires a page mapping lookup in the xfile. For a small increase in kernel memory requirements, we could instead bulk load the xfarray records into memory, use the kernel's existing heapsort implementation to sort the records, and bulk store the memory buffer back into the xfile. On the author's computer, this reduces the runtime by about 5% on a 500,000 element array. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: enable sorting of xfile-backed arraysDarrick J. Wong
The btree bulk loading code requires that records be provided in the correct record sort order for the given btree type. In general, repair code cannot be required to collect records in order, and it is not feasible to insert new records in the middle of an array to maintain sort order. Implement a sorting algorithm so that we can sort the records just prior to bulk loading. In principle, an xfarray could consume many gigabytes of memory and its backing pages can be sent out to disk at any time. This means that we cannot map the entire array into memory at once, so we must find a way to divide the work into smaller portions (e.g. a page) that /can/ be mapped into memory. Quicksort seems like a reasonable fit for this purpose, since it uses a divide and conquer strategy to keep its average runtime logarithmic. The solution presented here is a port of the glibc implementation, which itself is derived from the median-of-three and tail call recursion strategies outlined by Sedgwick. Subsequent patches will optimize the implementation further by utilizing the kernel's heapsort on directly-mapped memory whenever possible, and improving the quicksort pivot selection algorithm to try to avoid O(n^2) collapses. Note: The sorting functionality gets its own patch because the basic big array mechanisms were plenty for a single code patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: create a big array data structureDarrick J. Wong
Create a simple 'big array' data structure for storage of fixed-size metadata records that will be used to reconstruct a btree index. For repair operations, the most important operations are append, iterate, and sort. Earlier implementations of the big array used linked lists and suffered from severe problems -- pinning all records in kernel memory was not a good idea and frequently lead to OOM situations; random access was very inefficient; and record overhead for the lists was unacceptably high at 40-60%. Therefore, the big memory array relies on the 'xfile' abstraction, which creates a memfd file and stores the records in page cache pages. Since the memfd is created in tmpfs, the memory pages can be pushed out to disk if necessary and we have a built-in usage limit of 50% of physical memory. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: move btree bulkload record initialization to ->get_record implementationsrepair-prep-for-bulk-loading_2022-10-14Darrick J. Wong
When we're performing a bulk load of a btree, move the code that actually stores the btree record in the new btree block out of the generic code and into the individual ->get_record implementations. This is preparation for being able to store multiple records with a single indirect call. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: log EFIs for all btree blocks being used to stage a btreeDarrick J. Wong
We need to log EFIs for every extent that we allocate for the purpose of staging a new btree so that if we fail then the blocks will be freed during log recovery. Add a function to relog the EFIs, so that repair can relog them all every time it creates a new btree block, which will help us to avoid pinning the log tail. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: add debug knobs to control btree bulk load slack factorsDarrick J. Wong
Add some debug knobs so that we can control the leaf and node block slack when rebuilding btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: implement block reservation accounting for btrees we're stagingDarrick J. Wong
Create a new xrep_newbt structure to encapsulate a fake root for creating a staged btree cursor as well as to track all the blocks that we need to reserve in order to build that btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: force all buffers to be written during btree bulk loadDarrick J. Wong
While stress-testing online repair of btrees, I noticed periodic assertion failures from the buffer cache about buffer readers encountering buffers with DELWRI_Q set, even though the btree bulk load had already committed and the buffer itself wasn't on any delwri list. I traced this to a misunderstanding of how the delwri lists work, particularly with regards to the AIL's buffer list. If a buffer is logged and committed, the buffer can end up on that AIL buffer list. If btree repairs are run twice in rapid succession, it's possible that the first repair will invalidate the buffer and free it before the next time the AIL wakes up. This clears DELWRI_Q from the buffer state. If the second repair allocates the same block, it will then recycle the buffer to start writing the new btree block. Meanwhile, if the AIL wakes up and walks the buffer list, it will ignore the buffer because it can't lock it, and go back to sleep. When the second repair calls delwri_queue to put the buffer on the list of buffers to write before committing the new btree, it will set DELWRI_Q again, but since the buffer hasn't been removed from the AIL's buffer list, it won't add it to the bulkload buffer's list. This is incorrect, because the bulkload caller relies on delwri_submit to ensure that all the buffers have been sent to disk /before/ committing the new btree root pointer. This ordering requirement is required for data consistency. Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally drop it, so the next thread to walk through the btree will trip over a debug assertion on that flag. To fix this, create a new function that waits for the buffer to be removed from any other delwri lists before adding the buffer to the caller's delwri list. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: use per-AG bitmaps to reap unused AG metadata blocks during repairrepair-reap-fixes_2022-10-14Darrick J. Wong
The AGFL repair code uses a series of bitmaps to figure out where there are OWN_AG blocks that are not claimed by the free space and rmap btrees. These blocks become the new AGFL, and any overflow is reaped. The bitmaps current track xfs_fsblock_t even though we already know the AG number. In the last patch, we introduced a new bitmap "type" for tracking xfs_agblock_t extents. Port the reaping code and the AGFL repair to use this new type, which makes it very obvious what we're tracking. This also eliminates a bunch of unnecessary agblock <-> fsblock conversions. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: reap large AG metadata extents when possibleDarrick J. Wong
When we're freeing extents that have been set in a bitmap, break the bitmap extent into multiple sub-extents organized by fate, and reap the extents. This enables us to dispose of old resources more efficiently than doing them block by block. While we're at it, rename the reaping functions to make it clear that they're reaping per-AG extents. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: ignore stale buffers when scanning the buffer cacheDarrick J. Wong
After an online repair, we need to invalidate buffers representing the blocks from the old metadata that we're replacing. It's possible that parts of a tree that were previously cached in memory are no longer accessible due to media failure or other corruption on interior nodes, so repair figures out the old blocks from the reverse mapping data and scans the buffer cache directly. Unfortunately, the current buffer cache code triggers asserts if the rhashtable lookup finds a non-stale buffer of a different length than the key we searched for. For regular operation this is desirable, but for this repair procedure, we don't care since we're going to forcibly stale the buffer anyway. Add an internal lookup flag to avoid the assert. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: rearrange xrep_reap_block to make future code flow easierDarrick J. Wong
Rearrange the logic inside xrep_reap_block to make it more obvious that crosslinked metadata blocks are handled differently. Add a couple of tracepoints so that we can tell what's going on at the end of a btree rebuild operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: use deferred frees to reap old btree blocksDarrick J. Wong
Use deferred frees (EFIs) to reap the blocks of a btree that we just replaced. This helps us to shrink the window in which those old blocks could be lost due to a system crash, though we try to flush the EFIs every few hundred blocks so that we don't also overflow the transaction reservations during and after we commit the new btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: only allow reaping of per-AG blocks in xrep_reap_extentsDarrick J. Wong
Now that we've refactored btree cursors to require the caller to pass in a perag structure, there are numerous problems in xrep_reap_extents if it's being called to reap extents for an inode metadata repair. We don't have any repair functions that can do that, so drop the support for now. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: only invalidate blocks if we're going to free themDarrick J. Wong
When we're discarding old btree blocks after a repair, only invalidate the buffers for the ones that we're freeing -- if the metadata was crosslinked with another data structure, we don't want to touch it. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: move the post-repair block reaping code to a separate fileDarrick J. Wong
Reaping blocks after a repair is a complicated affair involving a lot of rmap btree lookups and figuring out if we're going to unmap or free old metadata blocks that might be crosslinked. Eventually, we will need to be able to reap per-AG metadata blocks, bmbt blocks from inode forks, garbage CoW staging extents, and (even later) blocks from btrees rooted in inodes. This results in a lot of reaping code, so we might as well split that off while it's easy. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cull repair code that will never get usedDarrick J. Wong
These two functions date from the era when I thought that we could rebuild btrees by creating an alternate root and adding records one by one. In other words, they predate the btree bulk loader. They're not necessary now, so remove them. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cross-reference rmap records with refcount btreesscrub-strengthen-rmap-checking_2022-10-14Darrick J. Wong
Strengthen the rmap btree record checker a little more by comparing OWN_REFCBT reverse mappings against the refcount btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cross-reference rmap records with inode btreesDarrick J. Wong
Strengthen the rmap btree record checker a little more by comparing OWN_INOBT reverse mappings against the inode btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cross-reference rmap records with free space btreesDarrick J. Wong
Strengthen the rmap btree record checker a little more by comparing OWN_AG reverse mappings against the free space btrees, the rmap btree, and the AGFL. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cross-reference rmap records with ag btreesDarrick J. Wong
Strengthen the rmap btree record checker a little more by comparing OWN_FS and OWN_LOG reverse mappings against the AG headers and internal logs, respectively. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: introduce bitmap type for AG blocksDarrick J. Wong
Create a typechecked bitmap for extents within an AG. Online repair uses bitmaps to store various different types of numbers, so let's make it obvious when we're storing xfs_agblock_t (and later xfs_fsblock_t) versus anything else. In subsequent patches, we're going to use agblock bitmaps to enhance the rmapbt checker to look for discrepancies between the rmapbt records and AG metadata block usage. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: convert xbitmap to interval treerepair-bitmap-rework_2022-10-14Darrick J. Wong
Convert the xbitmap code to use interval trees instead of linked lists. This reduces the amount of coding required to handle the disunion operation and in the future will make it easier to set bits in arbitrary order yet later be able to extract maximally sized extents, which we'll need for rebuilding certain structures. We define our own interval tree type so that it can deal with 64-bit indices even on 32-bit machines. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: drop the _safe behavior from the xbitmap foreach macroDarrick J. Wong
It's not safe to edit bitmap intervals while we're iterating them with for_each_xbitmap_extent. None of the existing callers actually need that ability anyway, so drop the safe variable. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: remove the for_each_xbitmap_ helpersDarrick J. Wong
Remove the for_each_xbitmap_ macros in favor of proper iterator functions. We'll soon be switching this data structure over to an interval tree implementation, which means that we can't allow callers to modify the bitmap during iteration without telling us. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: only allocate free space bitmap for xattr scrub if neededscrub-fix-xattr-memory-mgmt_2022-10-14Darrick J. Wong
The free space bitmap is only required if we're going to check the bestfree space at the end of an xattr leaf block. Therefore, we can reduce the memory requirements of this scrubber if we can determine that the xattr is in short format. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: clean up xattr scrub initializationDarrick J. Wong
Clean up local variable initialization and error returns in xchk_xattr. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: check used space of shortform xattr structuresDarrick J. Wong
Make sure that the records used inside a shortform xattr structure do not overlap. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: move xattr scrub buffer allocation to top level functionDarrick J. Wong
Move the xchk_setup_xattr_buf call from xchk_xattr_block to xchk_xattr, since we only need to set up the leaf block bitmaps once. Signed-off-by: Darrick J. Wong <djwong@kernel.org>