summaryrefslogtreecommitdiff
path: root/fs/xfs/scrub
AgeCommit message (Collapse)Author
2023-10-19xfs: simplify rt bitmap/summary block accessor functionsDarrick J. Wong
Simplify the calling convention of these functions since the xfs_rtalloc_args structure contains the parameters we need. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-19xfs: simplify xfs_rtbuf_get calling conventionsDarrick J. Wong
Now that xfs_rtalloc_args holds references to the last-read bitmap and summary blocks, we don't need to pass the buffer pointer out of xfs_rtbuf_get. Callers no longer have to xfs_trans_brelse on their own, though they are required to call xfs_rtbuf_cache_relse before the xfs_rtalloc_args goes out of scope. While we're at it, create some trivial helpers so that we don't have to remember if "0" means "bitmap" and "1" means "summary". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-19xfs: cache last bitmap block in realtime allocatorOmar Sandoval
Profiling a workload on a highly fragmented realtime device showed a ton of CPU cycles being spent in xfs_trans_read_buf() called by xfs_rtbuf_get(). Further tracing showed that much of that was repeated calls to xfs_rtbuf_get() for the same block of the realtime bitmap. These come from xfs_rtallocate_extent_block(): as it walks through ranges of free bits in the bitmap, each call to xfs_rtcheck_range() and xfs_rtfind_{forw,back}() gets the same bitmap block. If the bitmap block is very fragmented, then this is _a lot_ of buffer lookups. The realtime allocator already passes around a cache of the last used realtime summary block to avoid repeated reads (the parameters rbpp and rsb). We can do the same for the realtime bitmap. This replaces rbpp and rsb with a struct xfs_rtbuf_cache, which caches the most recently used block for both the realtime bitmap and summary. xfs_rtbuf_get() now handles the caching instead of the callers, which requires plumbing xfs_rtbuf_cache to more functions but also makes sure we don't miss anything. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-18xfs: consolidate realtime allocation argumentsDave Chinner
Consolidate the arguments passed around the rt allocator into a struct xfs_rtalloc_arg similar to how the btree allocator arguments are consolidated in a struct xfs_alloc_arg.... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-18xfs: use accessor functions for summary info wordsrefactor-rtbitmap-accessors-6.7_2023-10-19Darrick J. Wong
Create get and set functions for rtsummary words so that we can redefine the ondisk format with a specific endianness. Note that this requires the definition of a distinct type for ondisk summary info words so that the compiler can perform proper typechecking. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-18xfs: create helpers for rtbitmap block/wordcount computationsrefactor-rtbitmap-macros-6.7_2023-10-19Darrick J. Wong
Create helper functions that compute the number of blocks or words necessary to store the rt bitmap. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: convert rt summary macros to helpersDarrick J. Wong
Convert the realtime summary file macros to helper functions so that we can improve type checking. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: convert the rtbitmap block and bit macros to static inline functionsDarrick J. Wong
Replace these macros with typechecked helper functions. Eventually we're going to add more logic to the helpers and it'll be easier if we don't have to macro it up. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: convert do_div calls to xfs_rtb_to_rtx helper callsDarrick J. Wong
Convert these calls to use the helpers, and clean up all these places where the same variable can have different units depending on where it is in the function. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: create a helper to compute leftovers of realtime extentsDarrick J. Wong
Create a helper to compute the misalignment between a file extent (xfs_extlen_t) and a realtime extent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: create a helper to convert rtextents to rtblocksDarrick J. Wong
Create a helper to convert a realtime extent to a realtime block. Later on we'll change the helper to use bit shifts when possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: convert rt extent numbers to xfs_rtxnum_tclean-up-realtime-units-6.7_2023-10-19Darrick J. Wong
Further disambiguate the xfs_rtblock_t uses by creating a new type, xfs_rtxnum_t, to store the position of an extent within the realtime section, in units of rtextents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: rename xfs_verify_rtext to xfs_verify_rtbextDarrick J. Wong
This helper function validates that a range of *blocks* in the realtime section is completely contained within the realtime section. It does /not/ validate ranges of *rtextents*. Rename the function to avoid suggesting that it does, and change the type of the @len parameter since xfs_rtblock_t is a position unit, not a length unit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: convert rt bitmap extent lengths to xfs_rtbxlen_tDarrick J. Wong
XFS uses xfs_rtblock_t for many different uses, which makes it much more difficult to perform a unit analysis on the codebase. One of these (ab)uses is when we need to store the length of a free space extent as stored in the realtime bitmap. Because there can be up to 2^64 realtime extents in a filesystem, we need a new type that is larger than xfs_rtxlen_t for callers that are querying the bitmap directly. This means scrub and growfs. Create this type as "xfs_rtbxlen_t" and use it to store 64-bit rtx lengths. 'b' stands for 'bitmap' or 'big'; reader's choice. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: convert xfs_extlen_t to xfs_rtxlen_t in the rt allocatorDarrick J. Wong
In most of the filesystem, we use xfs_extlen_t to store the length of a file (or AG) space mapping in units of fs blocks. Unfortunately, the realtime allocator also uses it to store the length of a rt space mapping in units of rt extents. This is confusing, since one rt extent can consist of many fs blocks. Separate the two by introducing a new type (xfs_rtxlen_t) to store the length of a space mapping (in units of realtime extents) that would be found in a file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-17xfs: move the xfs_rtbitmap.c declarations to xfs_rtbitmap.hDarrick J. Wong
Move all the declarations for functionality in xfs_rtbitmap.c into a separate xfs_rtbitmap.h header file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2023-10-12xfs: Remove duplicate includeJiapeng Chong
./fs/xfs/scrub/xfile.c: xfs_format.h is included more than once. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6209 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
2023-09-12xfs: only call xchk_stats_merge after validating scrub inputsfix-scrub-6.6_2023-09-12Darrick J. Wong
Harshit Mogalapalli slogged through several reports from our internal syzbot instance and observed that they all had a common stack trace: BUG: KASAN: user-memory-access in instrument_atomic_read_write include/linux/instrumented.h:96 [inline] BUG: KASAN: user-memory-access in atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:1294 [inline] BUG: KASAN: user-memory-access in queued_spin_lock include/asm-generic/qspinlock.h:111 [inline] BUG: KASAN: user-memory-access in do_raw_spin_lock include/linux/spinlock.h:187 [inline] BUG: KASAN: user-memory-access in __raw_spin_lock include/linux/spinlock_api_smp.h:134 [inline] BUG: KASAN: user-memory-access in _raw_spin_lock+0x76/0xe0 kernel/locking/spinlock.c:154 Write of size 4 at addr 0000001dd87ee280 by task syz-executor365/1543 CPU: 2 PID: 1543 Comm: syz-executor365 Not tainted 6.5.0-syzk #1 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x83/0xb0 lib/dump_stack.c:106 print_report+0x3f8/0x620 mm/kasan/report.c:478 kasan_report+0xb0/0xe0 mm/kasan/report.c:588 check_region_inline mm/kasan/generic.c:181 [inline] kasan_check_range+0x139/0x1e0 mm/kasan/generic.c:187 instrument_atomic_read_write include/linux/instrumented.h:96 [inline] atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:1294 [inline] queued_spin_lock include/asm-generic/qspinlock.h:111 [inline] do_raw_spin_lock include/linux/spinlock.h:187 [inline] __raw_spin_lock include/linux/spinlock_api_smp.h:134 [inline] _raw_spin_lock+0x76/0xe0 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] xchk_stats_merge_one.isra.1+0x39/0x650 fs/xfs/scrub/stats.c:191 xchk_stats_merge+0x5f/0xe0 fs/xfs/scrub/stats.c:225 xfs_scrub_metadata+0x252/0x14e0 fs/xfs/scrub/scrub.c:599 xfs_ioc_scrub_metadata+0xc8/0x160 fs/xfs/xfs_ioctl.c:1646 xfs_file_ioctl+0x3fd/0x1870 fs/xfs/xfs_ioctl.c:1955 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl fs/ioctl.c:857 [inline] __x64_sys_ioctl+0x199/0x220 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3e/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7ff155af753d Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc006e2568 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff155af753d RDX: 00000000200000c0 RSI: 00000000c040583c RDI: 0000000000000003 RBP: 00000000ffffffff R08: 00000000004010c0 R09: 00000000004010c0 R10: 00000000004010c0 R11: 0000000000000246 R12: 0000000000400cb0 R13: 00007ffc006e2670 R14: 0000000000000000 R15: 0000000000000000 </TASK> The root cause here is that xchk_stats_merge_one walks off the end of the xchk_scrub_stats.cs_stats array because it has been fed a garbage value in sm->sm_type. That occurs because I put the xchk_stats_merge in the wrong place -- it should have been after the last xchk_teardown call on our way out of xfs_scrub_metadata because we only call the teardown function if we called the setup function, and we don't call the setup functions if the inputs are obviously garbage. Thanks to Harshit for triaging the bug reports and bringing this to my attention. Fixes: d7a74cad8f45 ("xfs: track usage statistics of online fsck") Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-30Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Chandan Babu: - Chandan Babu will be taking over as the XFS release manager. He has reviewed all the patches that are in this branch, though I'm signing the branch one last time since I'm still technically maintainer. :P - Create a maintainer entry profile for XFS in which we lay out the various roles that I have played for many years. Aside from release manager, the remaining roles are as yet unfilled. - Start merging online repair -- we now have in-memory pageable memory for staging btrees, a bunch of pending fixes, and we've started the process of refactoring the scrub support code to support more of repair. In particular, reaping of old blocks from damaged structures. - Scrub the realtime summary file. - Fix a bug where scrub's quota iteration only ever returned the root dquot. Oooops. - Fix some typos. [ Pull request from Chandan Babu, but signed tag and description from Darrick Wong, thus the first person singular above is Darrick, not Chandan ] * tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (37 commits) fs/xfs: Fix typos in comments xfs: fix dqiterate thinko xfs: don't check reflink iflag state when checking cow fork xfs: simplify returns in xchk_bmap xfs: rewrite xchk_inode_is_allocated to work properly xfs: hide xfs_inode_is_allocated in scrub common code xfs: fix agf_fllast when repairing an empty AGFL xfs: allow userspace to rebuild metadata structures xfs: clear pagf_agflreset when repairing the AGFL xfs: allow the user to cancel repairs before we start writing xfs: don't complain about unfixed metadata when repairs were injected xfs: implement online scrubbing of rtsummary info xfs: always rescan allegedly healthy per-ag metadata after repair xfs: move the realtime summary file scrubber to a separate source file xfs: wrap ilock/iunlock operations on sc->ip xfs: get our own reference to inodes that we want to scrub xfs: track usage statistics of online fsck xfs: improve xfarray quicksort pivot xfs: create scaffolding for creating debugfs entries xfs: cache pages used for xfarray quicksort convergence ...
2023-08-10xfs: don't check reflink iflag state when checking cow forkscrub-bmap-fixes-6.6_2023-08-10Darrick J. Wong
Any inode on a reflink filesystem can have a cow fork, even if the inode does not have the reflink iflag set. This happens either because the inode once had the iflag set but does not now, because we don't free the incore cow fork until the icache deletes the inode; or because we're running in alwayscow mode. Either way, we can collapse both of the xfs_is_reflink_inode calls into one, and change it to xfs_has_reflink, now that the bmap checker will return ENOENT if there is no pointer to the incore fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: simplify returns in xchk_bmapDarrick J. Wong
Remove the pointless goto and return code in xchk_bmap, since it only serves to obscure what's going on in the function. Instead, return whichever error code is appropriate there. For nonexistent forks, this should have been ENOENT. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: rewrite xchk_inode_is_allocated to work properlyDarrick J. Wong
Back in the mists of time[1], I proposed this function to assist the inode btree scrubbers in checking the inode btree contents against the allocation state of the inode records. The original version performed a direct lookup in the inode cache and returned the allocation status if the cached inode hadn't been reused and wasn't in an intermediate state. Brian thought it would be better to use the usual iget/irele mechanisms, so that was changed for the final version. Unfortunately, this hasn't aged well -- the IGET_INCORE flag only has one user and clutters up the regular iget path, which makes it hard to reason about how it actually works. Worse yet, the inode inactivation series silently broke it because iget won't return inodes that are anywhere in the inactivation machinery, even though the caller is already required to prevent inode allocation and freeing. Inodes in the inactivation machinery are still allocated, but the current code's interactions with the iget code prevent us from being able to say that. Now that I understand the inode lifecycle better than I did in early 2017, I now realize that as long as the cached inode hasn't been reused and isn't actively being reclaimed, it's safe to access the i_mode field (with the AGI, rcu, and i_flags locks held), and we don't need to worry about the inode being freed out from under us. Therefore, port the original version to modern code structure, which fixes the brokennes w.r.t. inactivation. In the next patch we'll remove IGET_INCORE since it's no longer necessary. [1] https://lore.kernel.org/linux-xfs/149643868294.23065.8094890990886436794.stgit@birch.djwong.org/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: hide xfs_inode_is_allocated in scrub common codeDarrick J. Wong
This function is only used by online fsck, so let's move it there. In the next patch, we'll fix it to work properly and to require that the caller hold the AGI buffer locked. No major changes aside from adjusting the signature a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: fix agf_fllast when repairing an empty AGFLrepair-agfl-fixes-6.6_2023-08-10Darrick J. Wong
xfs/139 with parent pointers enabled occasionally pops up a corruption message when online fsck force-rebuild repairs an AGFL: XFS (sde): Metadata corruption detected at xfs_agf_verify+0x11e/0x220 [xfs], xfs_agf block 0x9e0001 XFS (sde): Unmount and run xfs_repair XFS (sde): First 128 bytes of corrupted metadata buffer: 00000000: 58 41 47 46 00 00 00 01 00 00 00 4f 00 00 40 00 XAGF.......O..@. 00000010: 00 00 00 01 00 00 00 02 00 00 00 05 00 00 00 01 ................ 00000020: 00 00 00 01 00 00 00 01 00 00 00 00 ff ff ff ff ................ 00000030: 00 00 00 00 00 00 00 05 00 00 00 05 00 00 00 00 ................ 00000040: 91 2e 6f b1 ed 61 4b 4d 8c 9b 6e 87 08 bb f6 36 ..o..aKM..n....6 00000050: 00 00 00 01 00 00 00 01 00 00 00 06 00 00 00 01 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ The root cause of this failure is that prior to the repair, there were zero blocks in the AGFL. This scenario is set up by the test case, since it formats with 64MB AGs and tries to ENOSPC the whole filesystem. In this case of flcount==0, we reset fllast to -1U, which then trips the write verifier's check that fllast is less than xfs_agfl_size(). Correct this code to set fllast to the last possible slot in the AGFL when flcount is zero, which mirrors the behavior of xfs_repair phase5 when it has to create a totally empty AGFL. Fixes: 0e93d3f43ec7 ("xfs: repair the AGFL") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: clear pagf_agflreset when repairing the AGFLDarrick J. Wong
Clear the pagf_agflreset flag when we're repairing the AGFL because we fix all the same padding problems that xfs_agfl_reset does. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: allow userspace to rebuild metadata structuresrepair-force-rebuild-6.6_2023-08-10Darrick J. Wong
Add a new (superuser-only) flag to the online metadata repair ioctl to force it to rebuild structures, even if they're not broken. We will use this to move metadata structures out of the way during a free space defragmentation operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: don't complain about unfixed metadata when repairs were injectedDarrick J. Wong
While debugging other parts of online repair, I noticed that if someone injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of metadata, and the metadata repair fails, we'll log a message about uncorrected errors in the filesystem. This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT and we're only doing the repair because the error injection knob is set. Repair functions are allowed to abort the entire operation at any point before committing new metadata, in which case the piece of metadata is in the same state as it was before. Therefore, the log message should be gated on the results of the scrub. Refactor the predicate and rearrange the code flow to make this happen. Note: If the repair function errors out after it commits the new metadata, the transaction cancellation will shut down the filesystem, which is an obvious sign of corrupt metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: allow the user to cancel repairs before we start writingrepair-tweaks-6.6_2023-08-10Darrick J. Wong
All online repair functions have the same structure: walk filesystem metadata structures gathering enough data to rebuild the structure, stage a new copy, and then commit the new copy. The gathering steps do not write anything to disk, so they are peppered with xchk_should_terminate calls to avoid softlockup warnings and to provide an opportunity to abort the repair (by killing xfs_scrub). However, it's not clear in the code base when is the last chance to abort cleanly without having to undo a bunch of structure. Therefore, add one more call to xchk_should_terminate (along with a comment) providing the sysadmin with the ability to abort before it's too late and to make it clear in the source code when it's no longer convenient or safe to abort a repair. As there are only four repair functions right now, this patch exists more to establish a precedent for subsequent additions than to deliver practical functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: always rescan allegedly healthy per-ag metadata after repairDarrick J. Wong
After an online repair function runs for a per-AG metadata structure, sc->sick_mask is supposed to reflect the per-AG metadata that the repair function fixed. Our next move is to re-check the metadata to assess the completeness of our repair, so we don't want the rebuilt structure to be excluded from the rescan just because the health system previously logged a problem with the data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: implement online scrubbing of rtsummary infoscrub-rtsummary-6.6_2023-08-10Darrick J. Wong
Finish the realtime summary scrubber by adding the functions we need to compute a fresh copy of the rtsummary info and comparing it to the copy on disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: move the realtime summary file scrubber to a separate source fileDarrick J. Wong
Move the realtime summary file checking code to a separate file in preparation to actually implement it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: wrap ilock/iunlock operations on sc->ipDarrick J. Wong
Scrub tracks the resources that it's holding onto in the xfs_scrub structure. This includes the inode being checked (if applicable) and the inode lock state of that inode. Replace the open-coded structure manipulation with a trivial helper to eliminate sources of error. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: get our own reference to inodes that we want to scrubDarrick J. Wong
When we want to scrub a file, get our own reference to the inode unconditionally. This will make disposal rules simpler in the long run. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: track usage statistics of online fsckscrub-usage-stats-6.6_2023-08-10Darrick J. Wong
Track the usage, outcomes, and run times of the online fsck code, and report these values via debugfs. The columns in the file are: * scrubber name * number of scrub invocations * clean objects found * corruptions found * optimizations found * cross referencing failures * inconsistencies found during cross referencing * incomplete scrubs * warnings * number of time scrub had to retry * cumulative amount of time spent scrubbing (microseconds) * number of repair inovcations * successfully repaired objects * cumuluative amount of time spent repairing (microseconds) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: improve xfarray quicksort pivotbig-array-6.6_2023-08-10Darrick J. Wong
Now that we have the means to do insertion sorts of small in-memory subsets of an xfarray, use it to improve the quicksort pivot algorithm by reading 7 records into memory and finding the median of that. This should prevent bad partitioning when a[lo] and a[hi] end up next to each other in the final sort, which can happen when sorting for cntbt repair when the free space is extremely fragmented (e.g. generic/176). This doesn't speed up the average quicksort run by much, but it will (hopefully) avoid the quadratic time collapse for which quicksort is famous. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: cache pages used for xfarray quicksort convergenceDarrick J. Wong
After quicksort picks a pivot item for a particular subsort, it walks the records in that subset from the outside in, rearranging them so that every record less than the pivot comes before it, and every record greater than the pivot comes after it. This scan has a lot of locality, so we can speed it up quite a bit by grabbing the xfile backing page and holding onto it as long as we possibly can. Doing so reduces the runtime by another 5% on the author's computer. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: speed up xfarray sort by sorting xfile page contents directlyDarrick J. Wong
If all the records in an xfarray subset live within the same memory page, we can short-circuit even more quicksort recursion by mapping that page into the local CPU and using the kernel's heapsort function to sort the subset. On the author's computer, this reduces the runtime by another 15% on a 500,000 element array. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: teach xfile to pass back direct-map pages to callerDarrick J. Wong
Certain xfile array operations (such as sorting) can be sped up quite a bit by allowing xfile users to grab a page to bulk-read the records contained within it. Create helper methods to facilitate this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: convert xfarray insertion sort to heapsort using scratchpad memoryDarrick J. Wong
In the previous patch, we created a very basic quicksort implementation for xfile arrays. While the use of an alternate sorting algorithm to avoid quicksort recursion on very small subsets reduces the runtime modestly, we could do better than a load and store-heavy insertion sort, particularly since each load and store requires a page mapping lookup in the xfile. For a small increase in kernel memory requirements, we could instead bulk load the xfarray records into memory, use the kernel's existing heapsort implementation to sort the records, and bulk store the memory buffer back into the xfile. On the author's computer, this reduces the runtime by about 5% on a 500,000 element array. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: enable sorting of xfile-backed arraysDarrick J. Wong
The btree bulk loading code requires that records be provided in the correct record sort order for the given btree type. In general, repair code cannot be required to collect records in order, and it is not feasible to insert new records in the middle of an array to maintain sort order. Implement a sorting algorithm so that we can sort the records just prior to bulk loading. In principle, an xfarray could consume many gigabytes of memory and its backing pages can be sent out to disk at any time. This means that we cannot map the entire array into memory at once, so we must find a way to divide the work into smaller portions (e.g. a page) that /can/ be mapped into memory. Quicksort seems like a reasonable fit for this purpose, since it uses a divide and conquer strategy to keep its average runtime logarithmic. The solution presented here is a port of the glibc implementation, which itself is derived from the median-of-three and tail call recursion strategies outlined by Sedgwick. Subsequent patches will optimize the implementation further by utilizing the kernel's heapsort on directly-mapped memory whenever possible, and improving the quicksort pivot selection algorithm to try to avoid O(n^2) collapses. Note: The sorting functionality gets its own patch because the basic big array mechanisms were plenty for a single code patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: create a big array data structureDarrick J. Wong
Create a simple 'big array' data structure for storage of fixed-size metadata records that will be used to reconstruct a btree index. For repair operations, the most important operations are append, iterate, and sort. Earlier implementations of the big array used linked lists and suffered from severe problems -- pinning all records in kernel memory was not a good idea and frequently lead to OOM situations; random access was very inefficient; and record overhead for the lists was unacceptably high at 40-60%. Therefore, the big memory array relies on the 'xfile' abstraction, which creates a memfd file and stores the records in page cache pages. Since the memfd is created in tmpfs, the memory pages can be pushed out to disk if necessary and we have a built-in usage limit of 50% of physical memory. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: use per-AG bitmaps to reap unused AG metadata blocks during repairrepair-reap-fixes-6.6_2023-08-10Darrick J. Wong
The AGFL repair code uses a series of bitmaps to figure out where there are OWN_AG blocks that are not claimed by the free space and rmap btrees. These blocks become the new AGFL, and any overflow is reaped. The bitmaps current track xfs_fsblock_t even though we already know the AG number. In the last patch, we introduced a new bitmap "type" for tracking xfs_agblock_t extents. Port the reaping code and the AGFL repair to use this new type, which makes it very obvious what we're tracking. This also eliminates a bunch of unnecessary agblock <-> fsblock conversions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: reap large AG metadata extents when possibleDarrick J. Wong
When we're freeing extents that have been set in a bitmap, break the bitmap extent into multiple sub-extents organized by fate, and reap the extents. This enables us to dispose of old resources more efficiently than doing them block by block. While we're at it, rename the reaping functions to make it clear that they're reaping per-AG extents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: allow scanning ranges of the buffer cache for live buffersDarrick J. Wong
After an online repair, we need to invalidate buffers representing the blocks from the old metadata that we're replacing. It's possible that parts of a tree that were previously cached in memory are no longer accessible due to media failure or other corruption on interior nodes, so repair figures out the old blocks from the reverse mapping data and scans the buffer cache directly. In other words, online fsck needs to find all the live (i.e. non-stale) buffers for a range of fsblocks so that it can invalidate them. Unfortunately, the current buffer cache code triggers asserts if the rhashtable lookup finds a non-stale buffer of a different length than the key we searched for. For regular operation this is desirable, but for this repair procedure, we don't care since we're going to forcibly stale the buffer anyway. Add an internal lookup flag to avoid the assert. Skip buffers that are already XBF_STALE. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: rearrange xrep_reap_block to make future code flow easierDarrick J. Wong
Rearrange the logic inside xrep_reap_block to make it more obvious that crosslinked metadata blocks are handled differently. Add a couple of tracepoints so that we can tell what's going on at the end of a btree rebuild operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: use deferred frees to reap old btree blocksDarrick J. Wong
Use deferred frees (EFIs) to reap the blocks of a btree that we just replaced. This helps us to shrink the window in which those old blocks could be lost due to a system crash, though we try to flush the EFIs every few hundred blocks so that we don't also overflow the transaction reservations during and after we commit the new btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: only allow reaping of per-AG blocks in xrep_reap_extentsDarrick J. Wong
Now that we've refactored btree cursors to require the caller to pass in a perag structure, there are numerous problems in xrep_reap_extents if it's being called to reap extents for an inode metadata repair. We don't have any repair functions that can do that, so drop the support for now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: only invalidate blocks if we're going to free themDarrick J. Wong
When we're discarding old btree blocks after a repair, only invalidate the buffers for the ones that we're freeing -- if the metadata was crosslinked with another data structure, we don't want to touch it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: move the post-repair block reaping code to a separate fileDarrick J. Wong
Reaping blocks after a repair is a complicated affair involving a lot of rmap btree lookups and figuring out if we're going to unmap or free old metadata blocks that might be crosslinked. Eventually, we will need to be able to reap per-AG metadata blocks, bmbt blocks from inode forks, garbage CoW staging extents, and (even later) blocks from btrees rooted in inodes. This results in a lot of reaping code, so we might as well split that off while it's easy. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2023-08-10xfs: cull repair code that will never get usedDarrick J. Wong
These two functions date from the era when I thought that we could rebuild btrees by creating an alternate root and adding records one by one. In other words, they predate the btree bulk loader. They're not necessary now, so remove them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>