Age | Commit message (Collapse) | Author |
|
If scrub finds that everything is ok with the filesystem, we need a way
to tell the health tracking that it can let go of indirect health flags,
since indirect flags only mean that at some point in the past we lost
some context.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
If an unhealthy inode gets inactivated, remember this fact in the
per-fs health summary.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter XFS_IS_CORRUPT failures, we should report that to
the health monitoring system for later reporting.
I started with this semantic patch and massaged everything until it
built:
@@
expression mp, test;
@@
- if (XFS_IS_CORRUPT(mp, test)) return -EFSCORRUPTED;
+ if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); return -EFSCORRUPTED; }
@@
expression mp, test;
identifier label, error;
@@
- if (XFS_IS_CORRUPT(mp, test)) { error = -EFSCORRUPTED; goto label; }
+ if (XFS_IS_CORRUPT(mp, test)) { xfs_btree_mark_sick(cur); error = -EFSCORRUPTED; goto label; }
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter corrupt btree blocks, we should report that to the
health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Split the setting of the sick and checked masks into separate functions
as part of preparing to add the ability for regular runtime fs code
(i.e. not scrub) to mark metadata structures sick when corruptions are
found. Improve the documentation of libxfs' requirements for helper
behavior.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Fix the nlinks now too.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create the necessary hooks in the file create/unlink/rename code so that
our live nlink scrub code can stay up to date with the rest of the
filesystem. This will be the means to keep our shadow link count
information up to date while the scan runs in real time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create the necessary scrub code to walk the filesystem's directory tree
so that we can compute file link counts. Similar to quotacheck, we
create an incore shadow array of link count information and then we walk
the filesystem a second time to compare the link counts. We need live
updates to keep the information up to date during the lengthy scan, so
this scrubber remains disabled until the next patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Currently, online scrub reuses the xfs_readdir code to walk every entry
in a directory. This isn't awesome for performance, since we end up
cycling the directory ILOCK needlessly and coding around the particular
quirks of the VFS dir_context interface.
Create a streamlined version of readdir that keeps the ILOCK (since the
walk function isn't going to copy stuff to userspace), skips a whole lot
of directory walk cursor checks (since we start at 0 and walk to the
end) and has a sane way to return error codes.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the shadow quota counters that live quotacheck creates to reset the
incore dquot counters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
While running xfs/804 (quota repairs racing with fsstress), I observed a
filesystem shutdown in the primary sb write verifier:
run fstests xfs/804 at 2022-05-23 18:43:48
XFS (sda4): Mounting V5 Filesystem
XFS (sda4): Ending clean mount
XFS (sda4): Quotacheck needed: Please wait.
XFS (sda4): Quotacheck: Done.
XFS (sda4): EXPERIMENTAL online scrub feature in use. Use at your own risk!
XFS (sda4): SB ifree sanity check failed 0xb5 > 0x80
XFS (sda4): Metadata corruption detected at xfs_sb_write_verify+0x5e/0x100 [xfs], xfs_sb block 0x0
XFS (sda4): Unmount and run xfs_repair
The "SB ifree sanity check failed" message was a debugging printk that I
added to the kernel; observe that 0xb5 - 0x80 = 53, which is less than
one inode chunk.
I traced this to the xfs_log_sb calls from the online quota repair code,
which tries to clear the CHKD flags from the superblock to force a
mount-time quotacheck if the repair fails. On a V5 filesystem,
xfs_log_sb updates the ondisk sb summary counters with the current
contents of the percpu counters. This is done without quiescing other
writer threads, which means it could be racing with a thread that has
updated icount and is about to update ifree.
If the other write thread had incremented ifree before updating icount,
the repair thread will write icount > ifree into the logged update. If
the AIL writes the logged superblock back to disk before anyone else
fixes this siutation, this will lead to a write verifier failure, which
causes a filesystem shutdown.
Resolve this problem by updating the quota flags and calling
xfs_sb_to_disk directly, which does not touch the percpu counters.
While we're at it, we can elide the entire update if the selected qflags
aren't set.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a shadow dqtrx system in the quotacheck code that hooks the
regular dquot counter update code. This will be the means to keep our
copy of the dquot counters up to date while the scan runs in real time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a new trio of scrub functions to check quota counters. While the
dquots themselves are filesystem metadata and should be checked early,
the dquot counter values are computed from other metadata and are
therefore summary counters. We don't plug these into the scrub dispatch
just yet, because we still need to be able to watch quota updates while
doing our scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
This patch implements a live file scanner for online fsck functions that
require the ability to walk a filesystem to gather metadata records and
stay informed about metadata changes to files that have already been
visited.
The iscan structure consists of two inode number cursors: one to track
which inode we want to visit next, and a second one to track which
inodes have already been visited. This second cursor is key to
capturing live updates to files previously scanned while the main thread
continues scanning -- any inode greater than this value hasn't been
scanned and can go on its way; any other update must be incorporated
into the collected data. It is critical for the scanning thraad to hold
exclusive access on the inode until after marking the inode visited.
This new code is split out as a separate patch from its initial user for
the sake of enabling the author to move patches around his tree with
ease. The intended usage model for this code is roughly:
xchk_iscan_start(iscan, 0, 0);
while ((error = xchk_iscan_iter(sc, iscan, &ip)) == 1) {
xfs_ilock(ip, ...);
/* capture inode metadata */
xchk_iscan_mark_visited(iscan, ip);
xfs_iunlock(ip, ...);
xfs_irele(ip);
}
xchk_iscan_stop(iscan);
if (error)
return error;
Hook functions for live updates can then do:
if (xchk_iscan_want_live_update(...))
/* update the captured inode metadata */
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Fix anything that causes the quota verifiers to fail.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rebuild the realtime bitmap from the realtime rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Add a helper function to repair the core and forks of a metadata inode,
so that we can get move onto the task of repairing higher level metadata
that lives in an inode.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Try to repair errors that we see in file CoW forks so that we don't do
stupid things like remap garbage into a file. There's not a lot we can
do with the COW fork -- the ondisk metadata record only that the COW
staging extents are owned by the refcount btree, which effectively means
that we can't reconstruct this incore structure from scratch.
Actually, this is even worse -- we can't touch written extents, because
those map space that are actively under writeback, and there's not much
to do with delalloc reservations. Hence we can only detect crosslinked
unwritten extents and fix them by punching out the problematic parts and
replacing them with delalloc extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
There are a couple of conditions that userspace can set to force repairs
of metadata. These really belong in the repair code and not open-coded
into the check code, so refactor them into a helper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Reintroduce to xrep_reap_extents the ability to reap extents from any
AG. We dropped this before because it was buggy, but in the next patch
we will gain the ability to reap old bmap btrees, which can have blocks
in any AG. To do this, we require that sc->sa is uninitialized, so that
we can use it to hold all the per-AG context for a given extent.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Building off the rmap scanner that we added in the previous patch, we
can now find block 0 and try to use the information contained inside of
it to guess the mode of an inode if it's totally improper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true. Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
If an inode is so badly damaged that it cannot be loaded into the cache,
fix the ondisk metadata and try again. If there /is/ a cached inode,
fix any problems and apply any optimizations that can be solved incore.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Soon, we will be adding the ability to repair inodes. Inode resource
usage is tracked in quota, which means that if we think we might have to
repair a file, we ought to attach dquots from the start. Do this before
we take the file's ILOCK, though we don't require success here because
quota itself could also be in need of repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Don't compile the quota helper functions if quota isn't being built into
the XFS module.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Reconstruct the refcount data from the rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rebuild the free space btrees from the gaps in the rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Clear the pagf_agflreset flag when we're repairing the AGFL because we
fix all the same padding problems that xfs_agfl_reset does.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Add a new (superuser-only) flag to the online metadata repair ioctl to
force it to rebuild structures, even if they're not broken. We will use
this to move metadata structures out of the way during a free space
defragmentation operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
While debugging other parts of online repair, I noticed that if someone
injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of
metadata, and the metadata repair fails, we'll log a message about
uncorrected errors in the filesystem.
This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT
and we're only doing the repair because the error injection knob is set.
Repair functions are allowed to abort the entire operation at any point
before committing new metadata, in which case the piece of metadata is
in the same state as it was before. Therefore, the log message should
be gated on the results of the scrub. Refactor the predicate and
rearrange the code flow to make this happen.
Note: If the repair function errors out after it commits the new
metadata, the transaction cancellation will shut down the filesystem,
which is an obvious sign of corrupt metadata.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
All online repair functions have the same structure: walk filesystem
metadata structures gathering enough data to rebuild the structure,
stage a new copy, and then commit the new copy.
The gathering steps do not write anything to disk, so they are peppered
with xchk_should_terminate calls to avoid softlockup warnings and to
provide an opportunity to abort the repair (by killing xfs_scrub).
However, it's not clear in the code base when is the last chance to
abort cleanly without having to undo a bunch of structure.
Therefore, add one more call to xchk_should_terminate (along with a
comment) providing the sysadmin with the ability to abort before it's
too late and to make it clear in the source code when it's no longer
convenient or safe to abort a repair. As there are only four repair
functions right now, this patch exists more to establish a precedent for
subsequent additions than to deliver practical functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
After an online repair function runs for a per-AG metadata structure,
sc->sick_mask is supposed to reflect the per-AG metadata that the repair
function fixed. Our next move is to re-check the metadata to assess
the completeness of our repair, so we don't want the rebuilt structure
to be excluded from the rescan just because the health system previously
logged a problem with the data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Finish the realtime summary scrubber by adding the functions we need to
compute a fresh copy of the rtsummary info and comparing it to the copy
on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Move the realtime summary file checking code to a separate file in
preparation to actually implement it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Scrub tracks the resources that it's holding onto in the xfs_scrub
structure. This includes the inode being checked (if applicable) and
the inode lock state of that inode. Replace the open-coded structure
manipulation with a trivial helper to eliminate sources of error.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we want to scrub a file, get our own reference to the inode
unconditionally. This will make disposal rules simpler in the long run.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Now that we have the means to do insertion sorts of small in-memory
subsets of an xfarray, use it to improve the quicksort pivot algorithm
by reading 7 records into memory and finding the median of that. This
should prevent bad partitioning when a[lo] and a[hi] end up next to each
other in the final sort, which can happen when sorting for cntbt repair
when the free space is extremely fragmented (e.g. generic/176).
This doesn't speed up the average quicksort run by much, but it will
(hopefully) avoid the quadratic time collapse for which quicksort is
famous.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
After quicksort picks a pivot item for a particular subsort, it walks
the records in that subset from the outside in, rearranging them so that
every record less than the pivot comes before it, and every record
greater than the pivot comes after it. This scan has a lot of locality,
so we can speed it up quite a bit by grabbing the xfile backing page and
holding onto it as long as we possibly can. Doing so reduces the
runtime by another 5% on the author's computer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
If all the records in an xfarray subset live within the same memory
page, we can short-circuit even more quicksort recursion by mapping that
page into the local CPU and using the kernel's heapsort function to sort
the subset. On the author's computer, this reduces the runtime by
another 15% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Certain xfile array operations (such as sorting) can be sped up quite a
bit by allowing xfile users to grab a page to bulk-read the records
contained within it. Create helper methods to facilitate this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
In the previous patch, we created a very basic quicksort implementation
for xfile arrays. While the use of an alternate sorting algorithm to
avoid quicksort recursion on very small subsets reduces the runtime
modestly, we could do better than a load and store-heavy insertion sort,
particularly since each load and store requires a page mapping lookup in
the xfile.
For a small increase in kernel memory requirements, we could instead
bulk load the xfarray records into memory, use the kernel's existing
heapsort implementation to sort the records, and bulk store the memory
buffer back into the xfile. On the author's computer, this reduces the
runtime by about 5% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The btree bulk loading code requires that records be provided in the
correct record sort order for the given btree type. In general, repair
code cannot be required to collect records in order, and it is not
feasible to insert new records in the middle of an array to maintain
sort order.
Implement a sorting algorithm so that we can sort the records just prior
to bulk loading. In principle, an xfarray could consume many gigabytes
of memory and its backing pages can be sent out to disk at any time.
This means that we cannot map the entire array into memory at once, so
we must find a way to divide the work into smaller portions (e.g. a
page) that /can/ be mapped into memory.
Quicksort seems like a reasonable fit for this purpose, since it uses a
divide and conquer strategy to keep its average runtime logarithmic.
The solution presented here is a port of the glibc implementation, which
itself is derived from the median-of-three and tail call recursion
strategies outlined by Sedgwick.
Subsequent patches will optimize the implementation further by utilizing
the kernel's heapsort on directly-mapped memory whenever possible, and
improving the quicksort pivot selection algorithm to try to avoid O(n^2)
collapses.
Note: The sorting functionality gets its own patch because the basic big
array mechanisms were plenty for a single code patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort.
Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.
Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages. Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
We need to log EFIs for every extent that we allocate for the purpose of
staging a new btree so that if we fail then the blocks will be freed
during log recovery. Add a function to relog the EFIs, so that repair
can relog them all every time it creates a new btree block, which will
help us to avoid pinning the log tail.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Add some debug knobs so that we can control the leaf and node block
slack when rebuilding btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The AGFL repair code uses a series of bitmaps to figure out where there
are OWN_AG blocks that are not claimed by the free space and rmap
btrees. These blocks become the new AGFL, and any overflow is reaped.
The bitmaps current track xfs_fsblock_t even though we already know the
AG number.
In the last patch, we introduced a new bitmap "type" for tracking
xfs_agblock_t extents. Port the reaping code and the AGFL repair to use
this new type, which makes it very obvious what we're tracking. This
also eliminates a bunch of unnecessary agblock <-> fsblock conversions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we're freeing extents that have been set in a bitmap, break the
bitmap extent into multiple sub-extents organized by fate, and reap the
extents. This enables us to dispose of old resources more efficiently
than doing them block by block.
While we're at it, rename the reaping functions to make it clear that
they're reaping per-AG extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|