summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-11-09xfs: support in-memory btreesDarrick J. Wong
Adapt the generic btree cursor code to be able to create a btree whose buffers come from a (presumably in-memory) buftarg with a header block that's specific to in-memory btrees. We'll connect this to other parts of online scrub in the next patches. Note that in-memory btrees always have a block size matching the system memory page size for efficiency reasons. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: consolidate btree block allocation tracepointsDarrick J. Wong
Don't waste tracepoint segment memory on per-btree block allocation tracepoints when we can do it from the generic btree code. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: consolidate btree block freeing tracepointsDarrick J. Wong
Don't waste tracepoint segment memory on per-btree block freeing tracepoints when we can do it from the generic btree code. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: support in-memory buffer cache targetsDarrick J. Wong
Allow the buffer cache to target in-memory files by connecting it to xfiles. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: teach buftargs to maintain their own buffer hashtableDarrick J. Wong
Currently, cached buffers are indexed by per-AG hashtables. This works great for the data device, but won't work for in-memory btrees. Make it so that buftargs can index buffers too. Introduce XFS_BSTATE_CACHED as an explicit state flag for buffers that are cached in an rhashtable, since we can't rely on b_pag being set for buffers that are cached but not on behalf of an AG. We'll soon be using the buffer cache for xfiles. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: dump xfiles for debugging purposesDarrick J. Wong
Add a debug function to dump an xfile's contents for debug purposes. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair summary countersrepair-fscounters_2022-11-09Darrick J. Wong
Use the same summary counter calculation infrastructure to generate new values for the in-core summary counters. The difference between the scrubber and the repairer is that the repairer will freeze the fs during setup, which means that the values should match exactly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: remove XCHK_REAPING_DISABLED from scrubDarrick J. Wong
Nobody uses this code anymore, so get rid of it. It was racy with regards to freezes and remounts anyway. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: stabilize fs summary counters for online fsckDarrick J. Wong
If the fscounters scrubber notices incorrect summary counters, it's entirely possible that scrub is simply racing with other threads that are updating the incore counters. There isn't a good way to stabilize percpu counters or set ourselves up to observe live updates with hooks like we do for the quotacheck or nlinks scanners, so we instead choose to freeze the filesystem long enough to walk the incore per-AG structures. Past me thought that it was going to be commonplace to have to freeze the filesystem to perform some kind of repair and set up a whole separate infrastructure to freeze the filesystem in such a way that userspace could not unfreeze while we were running. This involved adding a mutex and freeze_super/thaw_super functions and dealing with the fact that the VFS freeze/thaw functions can free the VFS superblock references on return. This was all very overwrought, since fscounters turned out to be the only user of scrub freezes, and it doesn't require the log to quiesce, only the incore superblock counters. We prevent other threads from changing the freeze level by adding a new SB_FREEZE_EXCLUSIVE level. The end result is that fscounters should be much more efficient. When we're checking a busy system and we can't stabilize the counters, the custom freeze will do less work, which should result in less downtime. Repair should be similarly speedy, but that's in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: update health status if we get a clean bill of healthindirect-health-reporting_2022-11-09Darrick J. Wong
If scrub finds that everything is ok with the filesystem, we need a way to tell the health tracking that it can let go of indirect health flags, since indirect flags only mean that at some point in the past we lost some context. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: remember sick inodes that get inactivatedDarrick J. Wong
If an unhealthy inode gets inactivated, remember this fact in the per-fs health summary. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: add secondary and indirect classes to the health tracking systemDarrick J. Wong
Establish two more classes of health tracking bits: * Indirect problems, which suggest problems in other health domains that we weren't able to preserve. * Secondary problems, which track state that's related to primary evidence of health problems; and The first class we'll use in an upcoming patch to record in the AG health status the fact that we ran out of memory and had to inactivate an inode with defective metadata. The second class we use to indicate that repair knows that an inode is bad and we need to fix it later. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report XFS_IS_CORRUPT errors to the health systemcorruption-health-reports_2022-11-09Darrick J. Wong
Whenever we encounter XFS_IS_CORRUPT failures, we should report that to the health monitoring system for later reporting. I started with this semantic patch and massaged everything until it built: @@ expression mp, test; @@ - if (XFS_IS_CORRUPT(mp, test)) return -EFSCORRUPTED; + if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); return -EFSCORRUPTED; } @@ expression mp, test; identifier label, error; @@ - if (XFS_IS_CORRUPT(mp, test)) { error = -EFSCORRUPTED; goto label; } + if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); error = -EFSCORRUPTED; goto label; } Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report realtime metadata corruption errors to the health systemDarrick J. Wong
Whenever we encounter corrupt realtime metadat blocks, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report quota block corruption errors to the health systemDarrick J. Wong
Whenever we encounter corrupt quota blocks, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report inode corruption errors to the health systemDarrick J. Wong
Whenever we encounter corrupt inode records, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report symlink block corruption errors to the health systemDarrick J. Wong
Whenever we encounter corrupt symbolic link blocks, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report dir/attr block corruption errors to the health systemDarrick J. Wong
Whenever we encounter corrupt directory or extended attribute blocks, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report btree block corruption errors to the health systemDarrick J. Wong
Whenever we encounter corrupt btree blocks, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report block map corruption errors to the health tracking systemDarrick J. Wong
Whenever we encounter a corrupt block mapping, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report ag header corruption errors to the health tracking systemDarrick J. Wong
Whenever we encounter a corrupt AG header, we should report that to the health monitoring system for later reporting. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report fs corruption errors to the health tracking systemDarrick J. Wong
Whenever we encounter corrupt fs metadata, we should report that to the health monitoring system for later reporting. A convenient program for identifying places to insert xfs_*_mark_sick calls is as follows: #!/bin/bash # Detect missing calls to xfs_*_mark_sick filter=cat tty -s && filter=less git grep -B3 EFSCORRUPTED fs/xfs/*.[ch] fs/xfs/libxfs/*.[ch] fs/xfs/scrub/*.[ch] | awk ' BEGIN { ignore = 0; lineno = 0; delete lines; } { if ($0 == "--") { if (!ignore) { for (i = 0; i < lineno; i++) { print(lines[i]); } printf("--\n"); } delete lines; lineno = 0; ignore = 0; } else if ($0 ~ /mark_sick/) { ignore = 1; } else if ($0 ~ /if .fa/) { ignore = 1; } else if ($0 ~ /failaddr/) { ignore = 1; } else if ($0 ~ /_verifier_error/) { ignore = 1; } else if ($0 ~ /^ \* .*EFSCORRUPTED/) { ignore = 1; } else if ($0 ~ /== -EFSCORRUPTED/) { ignore = 1; } else if ($0 ~ /!= -EFSCORRUPTED/) { ignore = 1; } else { lines[lineno++] = $0; } } ' | $filter Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: separate the marking of sick and checked metadataDarrick J. Wong
Split the setting of the sick and checked masks into separate functions as part of preparing to add the ability for regular runtime fs code (i.e. not scrub) to mark metadata structures sick when corruptions are found. Improve the documentation of libxfs' requirements for helper behavior. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: teach repair to fix file nlinksscrub-nlinks_2022-11-09Darrick J. Wong
Fix the nlinks now too. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: track file link count updates during live nlinks fsckDarrick J. Wong
Create the necessary hooks in the file create/unlink/rename code so that our live nlink scrub code can stay up to date with the rest of the filesystem. This will be the means to keep our shadow link count information up to date while the scan runs in real time. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: teach scrub to check file nlinksDarrick J. Wong
Create the necessary scrub code to walk the filesystem's directory tree so that we can compute file link counts. Similar to quotacheck, we create an incore shadow array of link count information and then we walk the filesystem a second time to compare the link counts. We need live updates to keep the information up to date during the lengthy scan, so this scrubber remains disabled until the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: streamline the directory iteration code for scrubDarrick J. Wong
Currently, online scrub reuses the xfs_readdir code to walk every entry in a directory. This isn't awesome for performance, since we end up cycling the directory ILOCK needlessly and coding around the particular quirks of the VFS dir_context interface. Create a streamlined version of readdir that keeps the ILOCK (since the walk function isn't going to copy stuff to userspace), skips a whole lot of directory walk cursor checks (since we start at 0 and walk to the end) and has a sane way to return error codes. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report health of inode link countsDarrick J. Wong
Report on the health of the inode link counts. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair dquots based on live quotacheck resultsrepair-quotacheck_2022-11-09Darrick J. Wong
Use the shadow quota counters that live quotacheck creates to reset the incore dquot counters. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair cannot update the summary counters when logging quota flagsDarrick J. Wong
While running xfs/804 (quota repairs racing with fsstress), I observed a filesystem shutdown in the primary sb write verifier: run fstests xfs/804 at 2022-05-23 18:43:48 XFS (sda4): Mounting V5 Filesystem XFS (sda4): Ending clean mount XFS (sda4): Quotacheck needed: Please wait. XFS (sda4): Quotacheck: Done. XFS (sda4): EXPERIMENTAL online scrub feature in use. Use at your own risk! XFS (sda4): SB ifree sanity check failed 0xb5 > 0x80 XFS (sda4): Metadata corruption detected at xfs_sb_write_verify+0x5e/0x100 [xfs], xfs_sb block 0x0 XFS (sda4): Unmount and run xfs_repair The "SB ifree sanity check failed" message was a debugging printk that I added to the kernel; observe that 0xb5 - 0x80 = 53, which is less than one inode chunk. I traced this to the xfs_log_sb calls from the online quota repair code, which tries to clear the CHKD flags from the superblock to force a mount-time quotacheck if the repair fails. On a V5 filesystem, xfs_log_sb updates the ondisk sb summary counters with the current contents of the percpu counters. This is done without quiescing other writer threads, which means it could be racing with a thread that has updated icount and is about to update ifree. If the other write thread had incremented ifree before updating icount, the repair thread will write icount > ifree into the logged update. If the AIL writes the logged superblock back to disk before anyone else fixes this siutation, this will lead to a write verifier failure, which causes a filesystem shutdown. Resolve this problem by updating the quota flags and calling xfs_sb_to_disk directly, which does not touch the percpu counters. While we're at it, we can elide the entire update if the selected qflags aren't set. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: track quota updates during live quotacheckDarrick J. Wong
Create a shadow dqtrx system in the quotacheck code that hooks the regular dquot counter update code. This will be the means to keep our copy of the dquot counters up to date while the scan runs in real time. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: implement live quotacheck inode scanDarrick J. Wong
Create a new trio of scrub functions to check quota counters. While the dquots themselves are filesystem metadata and should be checked early, the dquot counter values are computed from other metadata and are therefore summary counters. We don't plug these into the scrub dispatch just yet, because we still need to be able to watch quota updates while doing our scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: report the health of quota countsDarrick J. Wong
Report the health of quota counts. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: allow blocking notifier chains with filesystem hooksscrub-iscan_2022-11-09Darrick J. Wong
Make it so that we can switch between notifier chain implementations for testing purposes. On the author's test system, calling an empty srcu notifier chain cost about 19ns per call, vs. 4ns for a blocking notifier chain. Hm. Might we actually want regular blocking notifiers? Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: allow scrub to hook metadata updates in other writersDarrick J. Wong
Certain types of filesystem metadata can only be checked by scanning every file in the entire filesystem. Specific examples of this include quota counts, file link counts, and reverse mappings of file extents. Directory and parent pointer reconstruction may also fall into this category. File scanning is much trickier than scanning AG metadata because we have to take inode locks in the same order as the rest of [VX]FS, we can't be holding buffer locks when we do that, and scanning the whole filesystem takes time. Earlier versions of the online repair patchset relied heavily on fsfreeze as a means to quiesce the filesystem so that we could take locks in the proper order without worrying about concurrent updates from other writers. Reviewers of those patches opined that freezing the entire fs to check and repair something was not sufficiently better than unmounting to run fsck offline. I don't agree with that 100%, but the message was clear: find a way to repair things that minimizes the quiet period where nobody can write to the filesystem. Generally, building btree indexes online can be split into two phases: a collection phase where we compute the records that will be put into the new btree; and a construction phase, where we construct the physical btree blocks and persist them. While it's simple to hold resource locks for the entirety of the two phases to ensure that the new index is consistent with the rest of the system, we don't need to hold resource locks during the collection phase if we have a means to receive live updates of other work going on elsewhere in the system. The goal of this patch, then, is to enable online fsck to learn about metadata updates going on in other threads while it constructs a shadow copy of the metadata records to verify or correct the real metadata. To minimize the overhead when online fsck isn't running, we use srcu notifiers because they prioritize fast access to the notifier call chain (particularly when the chain is empty) at a cost to configuring notifiers. Online fsck should be relatively infrequent, so this is acceptable. The intended usage model is fairly simple. Code that modifies a metadata structure of interest should declare a xfs_hook_chain structure in some well defined place, and call xfs_hook_call whenever an update happens. Online fsck code should define a struct notifier_block and use xfs_hook_add to attach the block to the chain, along with a function to be called. This function should synchronize with the fsck scanner to update whatever in-memory data the scanner is collecting. When finished, xfs_hook_del removes the notifier from the list and waits for them all to complete. On the author's computer, calling an empty srcu notifier chain was observed to have an overhead averaging ~40ns with a maximum of 60ns. Adding a no-op notifier function increased the average to ~58ns and 66ns. When the quotacheck live update notifier is attached, the average increases to ~322ns with a max of 372ns to update scrub's in-memory observation data, assuming no lock contention. With jump labels enabled, calls to empty srcu notifier chains are elided from the call sites when there are no hooks registered, which means that the overhead is 0.36ns when fsck is not running. For compilers that do not support jump labels (all major architectures do), the overhead of a no-op notifier call is less bad (on a many-cpu system) than the atomic counter ops, so we make the hook switch itself a nop. Note: This new code is also split out as a separate patch from its initial user so that the author can move patches around his tree with ease. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: implement live inode scan for scrubDarrick J. Wong
This patch implements a live file scanner for online fsck functions that require the ability to walk a filesystem to gather metadata records and stay informed about metadata changes to files that have already been visited. The iscan structure consists of two inode number cursors: one to track which inode we want to visit next, and a second one to track which inodes have already been visited. This second cursor is key to capturing live updates to files previously scanned while the main thread continues scanning -- any inode greater than this value hasn't been scanned and can go on its way; any other update must be incorporated into the collected data. It is critical for the scanning thraad to hold exclusive access on the inode until after marking the inode visited. This new code is split out as a separate patch from its initial user for the sake of enabling the author to move patches around his tree with ease. The intended usage model for this code is roughly: xchk_iscan_start(iscan, 0, 0); while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) { xfs_ilock(ip, ...); /* capture inode metadata */ xchk_iscan_mark_visited(iscan, ip); xfs_iunlock(ip, ...); xfs_irele(ip); } xchk_iscan_stop(iscan); if (error) return error; Hook functions for live updates can then do: if (xchk_iscan_want_live_update(...)) /* update the captured inode metadata */ Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: speed up xfs_iwalk_adjust_start a little bitDarrick J. Wong
Replace the open-coded loop that recomputes freecount with a single call to a bit weight function. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair quotasrepair-quota_2022-11-09Darrick J. Wong
Fix anything that causes the quota verifiers to fail. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: online repair of realtime bitmapsDarrick J. Wong
Rebuild the realtime bitmap from the realtime rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: create a new inode fork block unmap helperDarrick J. Wong
Create a new helper to unmap blocks from an inode's fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair the inode core and forks of a metadata inodeDarrick J. Wong
Add a helper function to repair the core and forks of a metadata inode, so that we can get move onto the task of repairing higher level metadata that lives in an inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair problems in CoW forksrepair-file-mappings_2022-11-09Darrick J. Wong
Try to repair errors that we see in file CoW forks so that we don't do stupid things like remap garbage into a file. There's not a lot we can do with the COW fork -- the ondisk metadata record only that the COW staging extents are owned by the refcount btree, which effectively means that we can't reconstruct this incore structure from scratch. Actually, this is even worse -- we can't touch written extents, because those map space that are actively under writeback, and there's not much to do with delalloc reservations. Hence we can only detect crosslinked unwritten extents and fix them by punching out the problematic parts and replacing them with delalloc extents. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: create a ranged query function for refcount btreesDarrick J. Wong
Implement ranged queries for refcount records. The next patch will use this to scan refcount data. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: refactor repair forcing tests into a repair.c helperDarrick J. Wong
There are a couple of conditions that userspace can set to force repairs of metadata. These really belong in the repair code and not open-coded into the check code, so refactor them into a helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair inode fork block mapping data structuresDarrick J. Wong
Use the reverse-mapping btree information to rebuild an inode block map. Update the btree bulk loading code as necessary to support inode rooted btrees and fix some bitrot problems. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: reintroduce reaping of file metadata blocks to xrep_reap_extentsDarrick J. Wong
Reintroduce to xrep_reap_extents the ability to reap extents from any AG. We dropped this before because it was buggy, but in the next patch we will gain the ability to reap old bmap btrees, which can have blocks in any AG. To do this, we require that sc->sa is uninitialized, so that we can use it to hold all the per-AG context for a given extent. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair obviously broken inode modesrepair-inodes_2022-11-09Darrick J. Wong
Building off the rmap scanner that we added in the previous patch, we can now find block 0 and try to use the information contained inside of it to guess the mode of an inode if it's totally improper. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: zap broken inode forksDarrick J. Wong
Determine if inode fork damage is responsible for the inode being unable to pass the ifork verifiers in xfs_iget and zap the fork contents if this is true. Once this is done the fork will be empty but we'll be able to construct an in-core inode, and a subsequent call to the inode fork repair ioctl will search the rmapbt to rebuild the records that were in the fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: repair inode recordsDarrick J. Wong
If an inode is so badly damaged that it cannot be loaded into the cache, fix the ondisk metadata and try again. If there /is/ a cached inode, fix any problems and apply any optimizations that can be solved incore. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-11-09xfs: try to attach dquots to files before repairing themDarrick J. Wong
Soon, we will be adding the ability to repair inodes. Inode resource usage is tracked in quota, which means that if we think we might have to repair a file, we ought to attach dquots from the start. Do this before we take the file's ILOCK, though we don't require success here because quota itself could also be in need of repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org>