Age | Commit message (Collapse) | Author |
|
Reconstruct the refcount data from the rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rebuild the free space btrees from the gaps in the rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
After an online repair function runs for a per-AG metadata structure,
sc->sick_mask is supposed to reflect the per-AG metadata that the repair
function fixed. Our next move is to re-check the metadata to assess
the completeness of our repair, so we don't want the rebuilt structure
to be excluded from the rescan just because the health system previously
logged a problem with the data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Finish the realtime summary scrubber by adding the functions we need to
compute a fresh copy of the rtsummary info and comparing it to the copy
on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Move the realtime summary file checking code to a separate file in
preparation to actually implement it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we want to scrub a file, get our own reference to the inode
unconditionally. This will make disposal rules simpler in the long run.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort.
Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.
Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages. Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
We need to log EFIs for every extent that we allocate for the purpose of
staging a new btree so that if we fail then the blocks will be freed
during log recovery. Add a function to relog the EFIs, so that repair
can relog them all every time it creates a new btree block, which will
help us to avoid pinning the log tail.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Add some debug knobs so that we can control the leaf and node block
slack when rebuilding btrees.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a new xrep_newbt structure to encapsulate a fake root for
creating a staged btree cursor as well as to track all the blocks that
we need to reserve in order to build that btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Convert the xbitmap code to use interval trees instead of linked lists.
This reduces the amount of coding required to handle the disunion
operation and in the future will make it easier to set bits in arbitrary
order yet later be able to extract maximally sized extents, which we'll
need for rebuilding certain structures. We define our own interval tree
type so that it can deal with 64-bit indices even on 32-bit machines.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we're freeing extents that have been set in a bitmap, break the
bitmap extent into multiple sub-extents organized by fate, and reap the
extents. This enables us to dispose of old resources more efficiently
than doing them block by block.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
It's not safe to edit bitmap intervals while we're iterating them with
for_each_xbitmap_extent. None of the existing callers actually need
that ability anyway, so drop the safe variable.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Remove the for_each_xbitmap_ macros in favor of proper iterator
functions. We'll soon be switching this data structure over to an
interval tree implementation, which means that we can't allow callers to
modify the bitmap during iteration without telling us.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use deferred frees (EFIs) to reap the blocks of a btree that we just
replaced. This helps us to shrink the window in which those old blocks
could be lost due to a system crash, though we try to flush the EFIs
every few hundred blocks so that we don't also overflow the transaction
reservations during and after we commit the new btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Now that we've refactored btree cursors to require the caller to pass in
a perag structure, there are numerous problems in xrep_reap_extents if
it's being called to reap extents for an inode metadata repair. We
don't have any repair functions that can do that, so drop the support
for now.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we're discarding old btree blocks after a repair, only invalidate
the buffers for the ones that we're freeing -- if the metadata was
crosslinked with another data structure, we don't want to touch it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Most callers of xfs_rmap_lookup_le will retrieve the btree record
immediately if the lookup succeeds. The overlapped version of this
function (xfs_rmap_lookup_le_range) will return the record if the lookup
succeeds, so make the regular version do it too. Get rid of the useless
len argument, since it's not part of the lookup key.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Gaps in the reference count btree are also significant -- for these
regions, there must not be any overlapping reverse mappings. We don't
currently check this, so make the refcount scrubber more complete.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Teach scrub to flag quota files containing unwritten extents.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When scrub is checking file fork mappings against rmap records and
the rmap record starts before or ends after the bmap record, check the
adjacent bmap records to make sure that they're adjacent to the one
we're checking. This helps us to detect cases where the rmaps cover
territory that the bmaps do not.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Teach the summary count checker to count the number of free realtime
extents and compare that to the superblock copy.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When scrub is checking a non-root btree block, it should make sure that
the keys in the parent btree block accurately capture the keyspace that
the child block stores.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The kernel test robot found the following bug when running xfs/355 to
scrub a bmap btree:
XFS: Assertion failed: !sa->pag, file: fs/xfs/scrub/common.c, line: 412
------------[ cut here ]------------
kernel BUG at fs/xfs/xfs_message.c:110!
invalid opcode: 0000 [#1] SMP PTI
CPU: 2 PID: 1415 Comm: xfs_scrub Not tainted 5.14.0-rc4-00021-g48c6615cc557 #1
Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013
RIP: 0010:assfail+0x23/0x28 [xfs]
RSP: 0018:ffffc9000aacb890 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffffc9000aacbcc8 RCX: 0000000000000000
RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffc09e7dcd
RBP: ffffc9000aacbc80 R08: ffff8881fdf17d50 R09: 0000000000000000
R10: 000000000000000a R11: f000000000000000 R12: 0000000000000000
R13: ffff88820c7ed000 R14: 0000000000000001 R15: ffffc9000aacb980
FS: 00007f185b955700(0000) GS:ffff8881fdf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7f6ef43000 CR3: 000000020de38002 CR4: 00000000001706e0
Call Trace:
xchk_ag_read_headers+0xda/0x100 [xfs]
xchk_ag_init+0x15/0x40 [xfs]
xchk_btree_check_block_owner+0x76/0x180 [xfs]
xchk_btree_get_block+0xd0/0x140 [xfs]
xchk_btree+0x32e/0x440 [xfs]
xchk_bmap_btree+0xd4/0x140 [xfs]
xchk_bmap+0x1eb/0x3c0 [xfs]
xfs_scrub_metadata+0x227/0x4c0 [xfs]
xfs_ioc_scrub_metadata+0x50/0xc0 [xfs]
xfs_file_ioctl+0x90c/0xc40 [xfs]
__x64_sys_ioctl+0x83/0xc0
do_syscall_64+0x3b/0xc0
The unusual handling of errors while initializing struct xchk_ag is the
root cause here. Since the beginning of xfs_scrub, the goal of
xchk_ag_read_headers has been to read all three AG header buffers and
attach them both to the xchk_ag structure and the scrub transaction.
Corruption errors on any of the three headers doesn't necessarily
trigger an immediate return to userspace, because xfs_scrub can also
tell us to /fix/ the problem.
In other words, it's possible for the xchk_ag init functions to return
an error code and a partially filled out structure so that scrub can use
however much information it managed to pull. Before 5.15, it was
sufficient to cancel (or commit) the scrub transaction on the way out of
the scrub code to release the buffers.
Ccommit 48c6615cc557 added a reference to the perag structure to struct
xchk_ag. Since perag structures are not attached to transactions like
buffers are, this adds the requirement that the perag ref be released
explicitly. The scrub teardown function xchk_teardown was amended to do
this for the xchk_ag embedded in struct xfs_scrub.
Unfortunately, I forgot that certain parts of the scrub code probe
multiple AGs and therefore handle the initialization and cleanup on
their own. Specifically, the bmbt scrubber will initialize it long
enough to cross-reference AG metadata for btree blocks and for the
extent mappings in the bmbt.
If one of the AG headers is corrupt, the init function returns with a
live perag structure reference and some of the AG header buffers. If an
error occurs, the cross referencing will be noted as XCORRUPTion and
skipped, but the main scrub process will move on to the next record.
It is now necessary to release the perag reference before we try to
analyze something from a different AG, or else we'll trip over the
assertion noted above.
Fixes: 48c6615cc557 ("xfs: grab active perag ref when reading AG headers")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
|
|
Stop directly referencing b_bn in code outside the buffer cache, as
b_bn is supposed to be used only as an internal cache index.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Introduce a helper function xfs_buf_daddr() to extract the disk
address of the buffer from the struct xfs_buf. This will replace
direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as
the XFS_BUF_ADDR() macro.
This patch introduces the helper function and replaces all uses of
XFS_BUF_ADDR() as this is just a simple sed replacement.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rather than open coding XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5
checks everywhere, add a simple wrapper to encapsulate this and make
the code easier to read.
This allows us to remove the xfs_sb_version_has_v3inode() wrapper
which is only used in xfs_format.h now and is just a version number
check.
There are a couple of places where we should be checking the mount
feature bits rather than the superblock version (e.g. remount), so
those are converted to use xfs_has_crc(mp) instead.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
This is a conversion of the remaining xfs_sb_version_has..(sbp)
checks to use xfs_has_..(mp) feature checks.
This was largely done with a vim replacement macro that did:
:0,$s/xfs_sb_version_has\(.*\)&\(.*\)->m_sb/xfs_has_\1\2/g<CR>
A couple of other variants were also used, and the rest touched up
by hand.
$ size -t fs/xfs/built-in.a
text data bss dec hex filename
before 1127533 311352 484 1439369 15f689 (TOTALS)
after 1125360 311352 484 1437196 15ee0c (TOTALS)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The scrub feature checks are the last place that the superblock
feature checks are used. Convert them to mount based feature checks.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Remove the shouty macro and instead use the inline function that
matches other state/feature check wrapper naming. This conversion
was done with sed.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The remaining mount flags kept in m_flags are actually runtime state
flags. These change dynamically, so they really should be updated
atomically so we don't potentially lose an update due to racing
modifications.
Convert these remaining flags to be stored in m_opstate and use
atomic bitops to set and clear the flags. This also adds a couple of
simple wrappers for common state checks - read only and shutdown.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Replace m_flags feature checks with xfs_has_<feature>() calls and
rework the setup code to set flags in m_features.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Convert the xfs_sb_version_hasfoo() to checks against
mp->m_features. Checks of the superblock itself during disk
operations (e.g. in the read/write verifiers and the to/from disk
formatters) are not converted - they operate purely on the
superblock state. Everything else should use the mount features.
Large parts of this conversion were done with sed with commands like
this:
for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do
sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f
done
With manual cleanups for things like "xfs_has_extflgbit" and other
little inconsistencies in naming.
The result is ia lot less typing to check features and an XFS binary
size reduced by a bit over 3kB:
$ size -t fs/xfs/built-in.a
text data bss dec hex filenam
before 1130866 311352 484 1442702 16038e (TOTALS)
after 1127727 311352 484 1439563 15f74b (TOTALS)
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Because there are a lot of tracepoints that express numeric data with
an associated unit and tag, document what they are to help everyone else
keep these thigns straight.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
When using pretty-printed scrub tracepoints, decode the meaning of the
scrub flags as strings for easier reading.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Always print inode generation in hexadecimal and preceded with the unit
"gen".
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Emit whichfork values as text strings in the ftrace output.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Some of our tracepoints have a field known as "len". That name doesn't
describe any units, which makes the fields not very useful. Rename the
fields to capture units and ensure the format is hexadecimal.
"fsbcount" are in units of fs blocks
"bbcount" are in units of 512b blocks
"ireccount" are in units of inodes
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Some of our tracepoints describe fields as "offset". That name doesn't
describe any units, which makes the fields not very useful. Rename the
fields to capture units and ensure the format is hexadecimal.
"fileoff" means file offset, in units of fs blocks
"pos" means file offset, in bytes
"forkoff" means inode fork offset, in bytes
The one remaining "offset" value is for iclogs, since that's the byte
offset of the end of where we've written into the current iclog.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Always print rmap owner number in hexadecimal and preceded with the unit
"owner".
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Always print allocation group block numbers in hexadecimal and preceded
with the unit "agbno".
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Always print allocation group numbers in hexadecimal and preceded with
the unit "agno".
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
Always print inode numbers in hexadecimal and preceded with the unit
"ino" or "agino", as apropriate. Fix one tracepoint that used "ino %u"
for an inode btree block count to reduce confusion.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
XFS_DADDR_TO_FSB converts a raw disk address (in units of 512b blocks)
to a raw disk address (in units of fs blocks). Unfortunately, the
xchk_block_error_class tracepoints incorrectly uses this to decode
xfs_daddr_t into segmented AG number and AG block addresses. Use the
correct translation code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
|
|
xchk_btree calls a user-supplied function to validate each btree record
that it finds. Those functions are not supposed to change the record
data, so mark the parameter const.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
The query_range functions are supposed to call a caller-supplied
function on each record found in the dataset. These functions don't
own the memory storing the record, so don't let them change the record.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Now that we always grab an active reference to a perag structure when
dealing with perag metadata, we can remove this unnecessary variable.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
There is no reason for this wrapper existing anymore. All the places
that use KM_NOFS allocation are within transaction contexts and
hence covered by memalloc_nofs_save/restore contexts. Hence we don't
need any special handling of vmalloc for large IOs anymore and
so special casing this code isn't necessary.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
This patch prepares scrub to deal with the possibility of tearing down
entire AGs by changing the order of resource acquisition to match the
rest of the XFS codebase. In other words, scrub now grabs AG resources
in order of: perag structure, then AGI/AGF/AGFL buffers, then btree
cursors; and releases them in reverse order.
This requires us to distinguish xchk_ag_init callers -- some are
responding to a user request to check AG metadata, in which case we can
return ENOENT to userspace; but other callers have an ondisk reference
to an AG that they're trying to cross-reference. In this second case,
the lack of an AG means there's ondisk corruption, since ondisk metadata
cannot point into nonexistent space.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
|