Age | Commit message (Collapse) | Author |
|
Rebuild the realtime bitmap from the realtime rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Add a helper function to repair the core and forks of a metadata inode,
so that we can get move onto the task of repairing higher level metadata
that lives in an inode.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Try to repair errors that we see in file CoW forks so that we don't do
stupid things like remap garbage into a file. There's not a lot we can
do with the COW fork -- the ondisk metadata record only that the COW
staging extents are owned by the refcount btree, which effectively means
that we can't reconstruct this incore structure from scratch.
Actually, this is even worse -- we can't touch written extents, because
those map space that are actively under writeback, and there's not much
to do with delalloc reservations. Hence we can only detect crosslinked
unwritten extents and fix them by punching out the problematic parts and
replacing them with delalloc extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
There are a couple of conditions that userspace can set to force repairs
of metadata. These really belong in the repair code and not open-coded
into the check code, so refactor them into a helper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Reintroduce to xrep_reap_extents the ability to reap extents from any
AG. We dropped this before because it was buggy, but in the next patch
we will gain the ability to reap old bmap btrees, which can have blocks
in any AG. To do this, we require that sc->sa is uninitialized, so that
we can use it to hold all the per-AG context for a given extent.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Building off the rmap scanner that we added in the previous patch, we
can now find block 0 and try to use the information contained inside of
it to guess the mode of an inode if it's totally improper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true. Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
If an inode is so badly damaged that it cannot be loaded into the cache,
fix the ondisk metadata and try again. If there /is/ a cached inode,
fix any problems and apply any optimizations that can be solved incore.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Soon, we will be adding the ability to repair inodes. Inode resource
usage is tracked in quota, which means that if we think we might have to
repair a file, we ought to attach dquots from the start. Do this before
we take the file's ILOCK, though we don't require success here because
quota itself could also be in need of repair.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Don't compile the quota helper functions if quota isn't being built into
the XFS module.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Reconstruct the refcount data from the rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rebuild the free space btrees from the gaps in the rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Clear the pagf_agflreset flag when we're repairing the AGFL because we
fix all the same padding problems that xfs_agfl_reset does.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Add a new (superuser-only) flag to the online metadata repair ioctl to
force it to rebuild structures, even if they're not broken. We will use
this to move metadata structures out of the way during a free space
defragmentation operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
While debugging other parts of online repair, I noticed that if someone
injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of
metadata, and the metadata repair fails, we'll log a message about
uncorrected errors in the filesystem.
This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT
and we're only doing the repair because the error injection knob is set.
Repair functions are allowed to abort the entire operation at any point
before committing new metadata, in which case the piece of metadata is
in the same state as it was before. Therefore, the log message should
be gated on the results of the scrub. Refactor the predicate and
rearrange the code flow to make this happen.
Note: If the repair function errors out after it commits the new
metadata, the transaction cancellation will shut down the filesystem,
which is an obvious sign of corrupt metadata.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
All online repair functions have the same structure: walk filesystem
metadata structures gathering enough data to rebuild the structure,
stage a new copy, and then commit the new copy.
The gathering steps do not write anything to disk, so they are peppered
with xchk_should_terminate calls to avoid softlockup warnings and to
provide an opportunity to abort the repair (by killing xfs_scrub).
However, it's not clear in the code base when is the last chance to
abort cleanly without having to undo a bunch of structure.
Therefore, add one more call to xchk_should_terminate (along with a
comment) providing the sysadmin with the ability to abort before it's
too late and to make it clear in the source code when it's no longer
convenient or safe to abort a repair. As there are only four repair
functions right now, this patch exists more to establish a precedent for
subsequent additions than to deliver practical functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
After an online repair function runs for a per-AG metadata structure,
sc->sick_mask is supposed to reflect the per-AG metadata that the repair
function fixed. Our next move is to re-check the metadata to assess
the completeness of our repair, so we don't want the rebuilt structure
to be excluded from the rescan just because the health system previously
logged a problem with the data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Finish the realtime summary scrubber by adding the functions we need to
compute a fresh copy of the rtsummary info and comparing it to the copy
on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Move the realtime summary file checking code to a separate file in
preparation to actually implement it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Scrub tracks the resources that it's holding onto in the xfs_scrub
structure. This includes the inode being checked (if applicable) and
the inode lock state of that inode. Replace the open-coded structure
manipulation with a trivial helper to eliminate sources of error.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we want to scrub a file, get our own reference to the inode
unconditionally. This will make disposal rules simpler in the long run.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Now that we have the means to do insertion sorts of small in-memory
subsets of an xfarray, use it to improve the quicksort pivot algorithm
by reading 7 records into memory and finding the median of that. This
should prevent bad partitioning when a[lo] and a[hi] end up next to each
other in the final sort, which can happen when sorting for cntbt repair
when the free space is extremely fragmented (e.g. generic/176).
This doesn't speed up the average quicksort run by much, but it will
(hopefully) avoid the quadratic time collapse for which quicksort is
famous.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
After quicksort picks a pivot item for a particular subsort, it walks
the records in that subset from the outside in, rearranging them so that
every record less than the pivot comes before it, and every record
greater than the pivot comes after it. This scan has a lot of locality,
so we can speed it up quite a bit by grabbing the xfile backing page and
holding onto it as long as we possibly can. Doing so reduces the
runtime by another 5% on the author's computer.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
If all the records in an xfarray subset live within the same memory
page, we can short-circuit even more quicksort recursion by mapping that
page into the local CPU and using the kernel's heapsort function to sort
the subset. On the author's computer, this reduces the runtime by
another 15% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Certain xfile array operations (such as sorting) can be sped up quite a
bit by allowing xfile users to grab a page to bulk-read the records
contained within it. Create helper methods to facilitate this.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
In the previous patch, we created a very basic quicksort implementation
for xfile arrays. While the use of an alternate sorting algorithm to
avoid quicksort recursion on very small subsets reduces the runtime
modestly, we could do better than a load and store-heavy insertion sort,
particularly since each load and store requires a page mapping lookup in
the xfile.
For a small increase in kernel memory requirements, we could instead
bulk load the xfarray records into memory, use the kernel's existing
heapsort implementation to sort the records, and bulk store the memory
buffer back into the xfile. On the author's computer, this reduces the
runtime by about 5% on a 500,000 element array.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The btree bulk loading code requires that records be provided in the
correct record sort order for the given btree type. In general, repair
code cannot be required to collect records in order, and it is not
feasible to insert new records in the middle of an array to maintain
sort order.
Implement a sorting algorithm so that we can sort the records just prior
to bulk loading. In principle, an xfarray could consume many gigabytes
of memory and its backing pages can be sent out to disk at any time.
This means that we cannot map the entire array into memory at once, so
we must find a way to divide the work into smaller portions (e.g. a
page) that /can/ be mapped into memory.
Quicksort seems like a reasonable fit for this purpose, since it uses a
divide and conquer strategy to keep its average runtime logarithmic.
The solution presented here is a port of the glibc implementation, which
itself is derived from the median-of-three and tail call recursion
strategies outlined by Sedgwick.
Subsequent patches will optimize the implementation further by utilizing
the kernel's heapsort on directly-mapped memory whenever possible, and
improving the quicksort pivot selection algorithm to try to avoid O(n^2)
collapses.
Note: The sorting functionality gets its own patch because the basic big
array mechanisms were plenty for a single code patch.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort.
Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.
Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages. Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Add some debug knobs so that we can control the leaf and node block
slack when rebuilding btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
We need to log EFIs for every extent that we allocate for the purpose of
staging a new btree so that if we fail then the blocks will be freed
during log recovery. Add a function to relog the EFIs, so that repair
can relog them all every time it creates a new btree block, which will
help us to avoid pinning the log tail.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The AGFL repair code uses a series of bitmaps to figure out where there
are OWN_AG blocks that are not claimed by the free space and rmap
btrees. These blocks become the new AGFL, and any overflow is reaped.
The bitmaps current track xfs_fsblock_t even though we already know the
AG number.
In the last patch, we introduced a new bitmap "type" for tracking
xfs_agblock_t extents. Port the reaping code and the AGFL repair to use
this new type, which makes it very obvious what we're tracking. This
also eliminates a bunch of unnecessary agblock <-> fsblock conversions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we're freeing extents that have been set in a bitmap, break the
bitmap extent into multiple sub-extents organized by fate, and reap the
extents. This enables us to dispose of old resources more efficiently
than doing them block by block.
While we're at it, rename the reaping functions to make it clear that
they're reaping per-AG extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
After an online repair, we need to invalidate buffers representing the
blocks from the old metadata that we're replacing. It's possible that
parts of a tree that were previously cached in memory are no longer
accessible due to media failure or other corruption on interior nodes,
so repair figures out the old blocks from the reverse mapping data and
scans the buffer cache directly.
Unfortunately, the current buffer cache code triggers asserts if the
rhashtable lookup finds a non-stale buffer of a different length than
the key we searched for. For regular operation this is desirable, but
for this repair procedure, we don't care since we're going to forcibly
stale the buffer anyway. Add an internal lookup flag to avoid the
assert.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rearrange the logic inside xrep_reap_block to make it more obvious that
crosslinked metadata blocks are handled differently. Add a couple of
tracepoints so that we can tell what's going on at the end of a btree
rebuild operation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use deferred frees (EFIs) to reap the blocks of a btree that we just
replaced. This helps us to shrink the window in which those old blocks
could be lost due to a system crash, though we try to flush the EFIs
every few hundred blocks so that we don't also overflow the transaction
reservations during and after we commit the new btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Now that we've refactored btree cursors to require the caller to pass in
a perag structure, there are numerous problems in xrep_reap_extents if
it's being called to reap extents for an inode metadata repair. We
don't have any repair functions that can do that, so drop the support
for now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we're discarding old btree blocks after a repair, only invalidate
the buffers for the ones that we're freeing -- if the metadata was
crosslinked with another data structure, we don't want to touch it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Reaping blocks after a repair is a complicated affair involving a lot of
rmap btree lookups and figuring out if we're going to unmap or free old
metadata blocks that might be crosslinked. Eventually, we will need to
be able to reap per-AG metadata blocks, bmbt blocks from inode forks,
garbage CoW staging extents, and (even later) blocks from btrees rooted
in inodes. This results in a lot of reaping code, so we might as well
split that off while it's easy.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
These two functions date from the era when I thought that we could
rebuild btrees by creating an alternate root and adding records one by
one. In other words, they predate the btree bulk loader. They're not
necessary now, so remove them.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Strengthen the rmap btree record checker a little more by comparing
OWN_REFCBT reverse mappings against the refcount btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Strengthen the rmap btree record checker a little more by comparing
OWN_INOBT reverse mappings against the inode btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Strengthen the rmap btree record checker a little more by comparing
OWN_AG reverse mappings against the free space btrees, the rmap btree,
and the AGFL.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Strengthen the rmap btree record checker a little more by comparing
OWN_FS and OWN_LOG reverse mappings against the AG headers and internal
logs, respectively.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a typechecked bitmap for extents within an AG. Online repair
uses bitmaps to store various different types of numbers, so let's make
it obvious when we're storing xfs_agblock_t (and later xfs_fsblock_t)
versus anything else.
In subsequent patches, we're going to use agblock bitmaps to enhance the
rmapbt checker to look for discrepancies between the rmapbt records and
AG metadata block usage.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Convert the xbitmap code to use interval trees instead of linked lists.
This reduces the amount of coding required to handle the disunion
operation and in the future will make it easier to set bits in arbitrary
order yet later be able to extract maximally sized extents, which we'll
need for rebuilding certain structures. We define our own interval tree
type so that it can deal with 64-bit indices even on 32-bit machines.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
It's not safe to edit bitmap intervals while we're iterating them with
for_each_xbitmap_extent. None of the existing callers actually need
that ability anyway, so drop the safe variable.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Remove the for_each_xbitmap_ macros in favor of proper iterator
functions. We'll soon be switching this data structure over to an
interval tree implementation, which means that we can't allow callers to
modify the bitmap during iteration without telling us.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|