Age | Commit message (Collapse) | Author |
|
Now that we have the infrastructure to switch background workers on and
off at will, fix the block gc worker code so that we don't actually run
the worker when the filesystem is frozen, same as we do for deferred
inactivation.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Allow callers of the userspace eofblocks ioctl to set a limit on the
number of inodes to scan, and then plumb that through the interface.
This removes a minor wart from the internal inode walk interface.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a polled version of xfs_inactive_force so that we can force
inactivation while holding a lock (usually the umount lock) without
tripping over the softlockup timer. This is for callers that hold vfs
locks while calling inactivation, which is currently unmount, iunlink
processing during mount, and rw->ro remount.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Split the inode inactivation work into per-AG work items so that we can
take advantage of parallelization.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Generally speaking, when a user calls fallocate, they're looking to
preallocate space in a file in the largest contiguous chunks possible.
If free space is low, it's possible that the free space will look
unnecessarily fragmented because there are unlinked inodes that are
holding on to space that we could allocate. When this happens,
fallocate makes suboptimal allocation decisions for the sake of deleted
files, which doesn't make much sense, so scan the filesystem for dead
items to delete to try to avoid this.
Note that there are a handful of fstests that fill a filesystem, delete
just enough files to allow a single large allocation, and check that
fallocate actually gets the allocation. These tests regress because the
test runs fallocate before the inode gc has a chance to run, so add this
behavior to maintain as much of the old behavior as possible.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Any time we try to modify a file's contents and it fails due to ENOSPC
or EDQUOT, force inode inactivation work to try to free space.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Allow administrators to control the length that we defer inode
inactivation. By default we'll set the delay to 2 seconds, as an
arbitrary choice between allowing for some batching of a deltree
operation, and not letting too many inodes pile up in memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
defer the inactivation phase to a separate workqueue. With this we
avoid blocking memory reclaim on filesystem metadata updates that are
necessary to free an in-core inode, such as post-eof block freeing, COW
staging extent freeing, and truncating and freeing unlinked inodes. Now
that work is deferred to a workqueue where we can do the freeing in
batches.
We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
The first flag helps our worker find inodes needing inactivation, and
the second flag marks inodes that are in the process of being
inactivated. A concurrent xfs_iget on the inode can still resurrect the
inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
Unfortunately, deferring the inactivation has one huge downside --
eventual consistency. Since all the freeing is deferred to a worker
thread, one can rm a file but the space doesn't come back immediately.
This can cause some odd side effects with quota accounting and statfs,
so we also force inactivation scans in order to maintain the existing
behaviors, at least outwardly.
For this patch we'll set the delay to zero to mimic the old timing as
much as possible; in the next patch we'll play with different delay
settings.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Hoist the code in xfs_iget_cache_hit that restores the VFS inode state
to an xfs_inode that was previously vfs-destroyed. The next patch will
add a new set of state flags, so we need the helper to avoid
duplication.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
In preparation for adding another incore inode tree tag, refactor the
code that sets and clears tags from the per-AG inode tree and the tree
of per-AG structures, and remove the open-coded versions used by the
blockgc code.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Merge these two inode walk loops together, since they're pretty similar
now. Get rid of XFS_ICI_NO_TAG since nobody uses it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Pass a pointer to the actual eofb structure around the inode scanner
functions instead of a void pointer, now that none of the functions is
used as a callback.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
It turns out that there is a 1:1 mapping between the execute and tag
parameters that are passed to xfs_inode_walk_ag:
xfs_blockgc_scan_inode <=> XFS_ICI_BLOCKGC_TAG
Since the only user of the inode walk function is the blockgc code, we
don't need the tag parameter or the execute function pointer. The inode
deferred inactivation changes in the next series will add a second
tag:function pair, so we'll leave the tag parameter for now.
For the price of a forward static declaration, we can eliminate the
indirect function call. This likely has a negligible impact on
performance (since the execute function runs transactions), but it also
simplifies the function signature.
Radix tree tags are unsigned ints, so fix the type usage for all those
tags.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
The sole iter_flags is XFS_INODE_WALK_INEW_WAIT, and there are no users.
Remove the flag, and the parameter, and all the code that used it.
Since there are no longer any external callers of xfs_inode_walk, make
it static.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Using xfs_inode_walk in xfs_qm_dqrele_all_inodes is complete overkill,
given that function simplify wants to iterate all live inodes known
to the VFS. Just iterate over the s_inodes list.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Fix the weird split of responsibilities between xfs_can_free_eofblocks
and xfs_free_eofblocks by moving the chunk of code that looks for any
actual post-EOF space mappings from the second function into the first.
This clears the way for deferred inode inactivation to be able to decide
if an inode needs inactivation work before committing the released inode
to the inactivation code paths (vs. marking it for reclaim).
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
In xfs_inode_free_eofblocks, move the xfs_can_free_eofblocks call
further down in the function to the point where we have taken the
IOLOCK. This is preparation for the next patch, where we will need that
lock (or equivalent) so that we can check if there are any post-eof
blocks to clean out.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Since we're about to start using the blockgc workqueue to dispose of
inactivated inodes, strip the "block" prefix from the name; now it's
merely the general garbage collection (gc) workqueue.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Files containing metadata (quota records, rt bitmap and summary info)
are fully managed by the filesystem, which means that all resource
cleanup must be explicit, not automatic. This means that they should
never be subjected automatic to post-eof truncation, nor should they be
freed automatically even if the link count drops to zero.
In other words, xfs_inactive() should leave these files alone. Add the
necessary predicate functions to make this happen. This adds a second
layer of prevention for the kinds of fs corruption that was fixed by
commit f4c32e87de7d. If we ever decide to support removing metadata
files, we should make all those metadata updates explicit.
Rearrange the order of #includes to fix compiler errors, since
xfs_mount.h is supposed to be included before xfs_inode.h
Followup-to: f4c32e87de7d ("xfs: fix realtime bitmap/summary file truncation when growing rt volume")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Whenever we encounter XFS_CORRUPT_ON failures, we should report that to
the health monitoring system for later reporting.
I started with this and massaged everything until it built:
@@
expression mp, test;
@@
- if (XFS_CORRUPT_ON(mp, test)) return -EFSCORRUPTED;
+ if (XFS_CORRUPT_ON(mp, test)) { xfs_btree_mark_sick(cur); return -EFSCORRUPTED; }
@@
expression mp, test;
identifier label, error;
@@
- if (XFS_CORRUPT_ON(mp, test)) { error = -EFSCORRUPTED; goto label; }
+ if (XFS_CORRUPT_ON(mp, test)) { xfs_btree_mark_sick(cur); error = -EFSCORRUPTED; goto label; }
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter corrupt realtime metadat blocks, we should report
that to the health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter corrupt quota blocks, we should report that to the
health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter corrupt inode records, we should report that to
the health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter corrupt symbolic link blocks, we should report
that to the health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter corrupt directory or extended attribute blocks, we
should report that to the health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter corrupt btree blocks, we should report that to the
health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter a corrupt block mapping, we should report that to
the health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Whenever we encounter a corrupt AG header, we should report that to the
health monitoring system for later reporting.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Split the setting of the sick and checked masks into separate functions
as part of preparing to add the ability for regular runtime fs code
(i.e. not scrub) to mark metadata structures sick when corruptions are
found. Improve the documentation of libxfs' requirements for helper
behavior.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the shadow quota counters that live quotacheck creates to reset the
incore dquot counters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a shadow dqtrx system in the quotacheck code that hooks the
regular dquot counter update code. This will be the means to keep our
copy of the dquot counters up to date while the scan runs in real time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a new trio of scrub functions to check quota counters. While the
dquots themselves are filesystem metadata and should be checked early,
the dquot counter values are computed from other metadata and are
therefore summary counters. We don't plug these into the scrub dispatch
just yet, because we still need to be able to watch quota updates while
doing our scan.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Report the health of quota counts.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Fix anything that causes the quota verifiers to fail.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a new helper to unmap blocks from an inode's fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Add a helper function to repair the core and forks of a metadata inode,
so that we can get move onto the task of repairing higher level metadata
that lives in an inode.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Repair inconsistent symbolic link data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the reverse-mapping btree information to rebuild an inode block map.
Update the btree bulk loading code as necessary to support inode rooted
btrees and fix some bitrot problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Building off the rmap scanner that we added in the previous patch, we
can now find block 0 and try to use the information contained inside of
it to guess the mode of an inode if it's totally improper.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true. Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Try to reinitialize corrupt inodes, or clear the reflink flag
if it's not needed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Reconstruct the refcount data from the rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Rebuild the free space btrees from the gaps in the rmap btree.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
After an online repair function runs for a per-AG metadata structure,
sc->sick_mask is supposed to reflect the per-AG metadata that the repair
function fixed. Our next move is to re-check the metadata to assess
the completeness of our repair, so we don't want the rebuilt structure
to be excluded from the rescan just because the health system previously
logged a problem with the data structure.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Finish the realtime summary scrubber by adding the functions we need to
compute a fresh copy of the rtsummary info and comparing it to the copy
on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Move the realtime summary file checking code to a separate file in
preparation to actually implement it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
When we want to scrub a file, get our own reference to the inode
unconditionally. This will make disposal rules simpler in the long run.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort.
Earlier implementations of the big array used linked lists and suffered
from severe problems -- pinning all records in kernel memory was not a
good idea and frequently lead to OOM situations; random access was very
inefficient; and record overhead for the lists was unacceptably high at
40-60%.
Therefore, the big memory array relies on the 'xfile' abstraction, which
creates a memfd file and stores the records in page cache pages. Since
the memfd is created in tmpfs, the memory pages can be pushed out to
disk if necessary and we have a built-in usage limit of 50% of physical
memory.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
We need to log EFIs for every extent that we allocate for the purpose of
staging a new btree so that if we fail then the blocks will be freed
during log recovery. Add a function to relog the EFIs, so that repair
can relog them all every time it creates a new btree block, which will
help us to avoid pinning the log tail.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|