Age | Commit message (Collapse) | Author |
|
Use the fs freezing mechanism we developed for the rmapbt repair to
freeze the fs, this time to scan the fs for a live quotacheck. We add a
new dqget variant to use the existing scrub transaction to allocate an
on-disk dquot block if it is missing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Rebuild the reverse mapping btree from all primary metadata.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Introduce a new 'online scrub freeze' that we can use to lock out all
filesystem modifications and background activity so that we can perform
global scans in order to rebuild metadata. This introduces a new IFLAG
to the scrub ioctl to indicate that userspace is willing to allow a
freeze.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
If scrub finds that everything is ok with the filesystem, we need a way
to tell the health tracking that it can let go of indirect health flags,
since indirect flags only mean that at some point in the past we lost
some context.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
If an unhealthy inode gets inactivated, remember this fact in the
per-fs health summary.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Establish two more classes of health tracking bits:
* Indirect problems, which suggest problems in other health domains
that we weren't able to preserve.
* Secondary problems, which track state that's related to primary
evidence of health problems; and
The first class we'll use in an upcoming patch to record in the AG
health status the fact that we ran out of memory and had to inactivate
an inode with defective metadata. The second class we use to indicate
that repair knows that an inode is bad and we need to fix it later.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Split the inode inactivation work into per-AG work items so that we can
take advantage of parallelization.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
If we think that inactivation will free enough blocks to make it easier
to satisfy an fallocate request, force inactivation.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Any time we try a file write that fails due to ENOSPC or EDQUOT, force
inactivation work to free up some resources and try one more time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
defer the inactivation phase to a separate workqueue. With this we
avoid blocking memory reclaim on filesystem metadata updates that are
necessary to free an in-core inode, such as post-eof block freeing, COW
staging extent freeing, and truncating and freeing unlinked inodes. Now
that work is deferred to a workqueue where we can do the freeing in
batches.
We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
The first flag helps our worker find inodes needing inactivation, and
the second flag marks inodes that are in the process of being
inactivated. A concurrent xfs_iget on the inode can still resurrect the
inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).
Unfortunately, deferring the inactivation has one huge downside --
eventual consistency. Since all the freeing is deferred to a worker
thread, one can rm a file but the space doesn't come back immediately.
This can cause some odd side effects with quota accounting and statfs,
so we also force inactivation scans in order to maintain the existing
behaviors, at least outwardly.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Refactor the code that determines if an inode matches an eofblocks
structure into a helper, since we already use it twice and we're about
to use it a third time.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Refactor the code that walks reclaim-tagged inodes so that we can reuse
the same loop in a subsequent patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Set up quota counters to track the number of inodes and blocks that will
be freed from inactivating unlinked inodes. We'll use this in the
deferred inactivation patch to hide the effects of deferred processing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Set up counters to track the number of inodes and blocks that will be
freed from inactivating unlinked inodes. We'll use this in the deferred
inactivation patch to hide the effects of deferred processing.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Add a predicate function to decide if an inode needs (deferred)
inactivation. Any file that has been unlinked or has speculative
preallocations either for post-EOF writes or for CoW qualifies.
This function will also be used by the upcoming deferred inactivation
patch.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Refactor the part of _free_eofblocks that decides if it's really going
to truncate post-EOF blocks into a separate helper function. The
upcoming deferred inode inactivation patch requires us to be able to
decide this prior to actual inactivation. No functionality changes.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
There are several problems with the initial implementations of the big
array and the blob array data structures. First, using linked lists
imposes a two-pointer overhead on every record stored. For blobs this
isn't serious, but for fixed-size records this increases memory
requirements by 40-60%. Second, we're using kernel memory to store the
intermediate records. Kernel memory cannot be paged out, which means we
run the risk of OOMing the machine when we run out of physical memory.
Therefore, replace the linked lists in both structures with memfd files.
Random access becomes much easier, memory overhead drops to a negligible
amount, and because memfd pages can be swapped, we have considerably
more flexibility for memory use.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Fix anything that causes the quota verifiers to fail.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
If an attr block indicates that it could use compaction, set the preen
flag to have the attr fork rebuilt, since the attr fork rebuilder can
take care of that for us.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|
If the extended attributes look bad, try to sift through the rubble to
find whatever keys/values we can, zap the attr tree, and re-add the
values.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Create a new helper to unmap blocks from an inode's fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Remove the transaction roll at the end of the loop in
xfs_itruncate_extents_flags. xfs_defer_finish takes care of rolling the
transaction as needed and reattaching the inode, which means we already
start each loop with a clean transaction.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
There's no reason why we can't consume unmap_len, just use the raw
version.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Create a simple 'blob array' data structure for storage of arbitrarily
sized metadata objects that will be used to reconstruct metadata. For
the intended usage (temporarily storing extended attribute names and
values) we only have to support storing objects and retrieving them.
This initial implementation uses linked lists to store the blobs, but a
subsequent patch will restructure the backend to avoid using high order
pinned kernel memory.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Repair inconsistent symbolic link data.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Use the reverse-mapping btree information to rebuild an inode fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Determine if inode fork damage is responsible for the inode being unable
to pass the ifork verifiers in xfs_iget and zap the fork contents if
this is true. Once this is done the fork will be empty but we'll be
able to construct an in-core inode, and a subsequent call to the inode
fork repair ioctl will search the rmapbt to rebuild the records that
were in the fork.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Try to reinitialize corrupt inodes, or clear the reflink flag
if it's not needed.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Reconstruct the refcount data from the rmap btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Use the rmapbt to find inode chunks, query the chunks to compute
hole and free masks, and with that information rebuild the inobt
and finobt.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Separate the inode geometry information into a distinct structure.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Rebuild the free space btrees from the gaps in the rmap btree.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Create a simple 'big array' data structure for storage of fixed-size
metadata records that will be used to reconstruct a btree index. For
repair operations, the most important operations are append, iterate,
and sort; while supported, get and put are not for frequent use.
For the initial implementation we will use linked-list containers,
though a subsequent patch will restructure the backend to avoid using
pinned kernel memory.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Allow repair functions to set a separate function pointer to validate
the metadata that they've rebuilt. This prevents us from exiting from a
repair function that rebuilds both A and B without checking that both A
and B can pass a scrub test. We'll need this for the free space and
inode btree repair strategies.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
If we have to initiate writeback of a range that starts beyond the
on-disk EOF, extend the flushed range to start at the on-disk EOF so
that there's no chance that we put real extents in the data fork having
not actually flushed the data.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
When writing to a delalloc region in the data fork, commit the new
allocations (of the da reservation) as unwritten so that the mappings
are only marked written once writeback completes successfully. This
fixes the problem of stale data exposure if the system goes down during
targeted writeback of a specific region of a file, as tested by
generic/042.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Skip cross-referencing with a btree if the health report tells us that
it's known to be bad. This should reduce the dmesg spew considerably.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Now that we have the ability to track sick metadata in-core, make scrub
and repair update those health assessments after doing work.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Now that we no longer memset the scrub context, we can move the
already_fixed variable into the scrub context's state flags instead of
passing around pointers to separate stack variables.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Combine all the boolean state flags in struct xfs_scrub into a single
unsigned int, because we're going to be adding more state flags soon.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
It's a little silly how the memset in scrub context initialization
forces us to declare stack variables to preserve context variables
across a retry. Since the teardown functions already null out most of
the ephemeral state (buffer pointers, btree cursors, etc.), just skip
the memset and move the initialization as needed.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Don't allow any modifications to a file that's marked immutable, which
means that we have to flush all the writable pages to make the readonly
and we have to check the setattr/setflags parameters more closely.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The chattr manpage has this to say about immutable files:
"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."
However, we don't actually check the immutable flag in the setattr code,
which means that we can update project ids and extent size hints on
supposedly immutable files. Therefore, reject a setattr call on an
immutable file except for the case where we're trying to unset
IMMUTABLE.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Clean up the calling convention since we're editing the fsxattr struct
anyway.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
Refactor the SETFLAGS implementation to use the SETXATTR code directly
instead of partially constructing a struct fsxattr and calling bits and
pieces of the setxattr code. This reduces code size and becomes
necessary in the next patch to maintain the behavior of allowing
userspace to set immutable on an immutable file so long as nothing
/else/ about the attributes change.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The chattr manpage has this to say about immutable files:
"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."
This means that we need to flush the page cache when setting the
immutable flag so that all mappings will become read-only again and
therefore programs cannot continue to write to writable mappings.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
We passed an inode into xfs_ioctl_setattr_get_trans with join_flags
indicating which locks are held on that inode. If we can't allocate a
transaction then we need to unlock the inode before we bail out.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
The chattr manpage has this to say about immutable files:
"A file with the 'i' attribute cannot be modified: it cannot be deleted
or renamed, no link can be created to this file, most of the file's
metadata can not be modified, and the file can not be opened in write
mode."
Once the flag is set, it is enforced for quite a few file operations,
such as fallocate, fpunch, fzero, rm, touch, open, etc. However, we
don't check for immutability when doing a write(), a PROT_WRITE mmap(),
a truncate(), or a write to a previously established mmap.
If a program has an open write fd to a file that the administrator
subsequently marks immutable, the program still can change the file
contents. Weird!
The ability to write to an immutable file does not follow the manpage
promise that immutable files cannot be modified. Worse yet it's
inconsistent with the behavior of other syscalls which don't allow
modifications of immutable files.
Therefore, add the necessary checks to make the write, mmap, and
truncate behavior consistent with what the manpage says and consistent
with other syscalls on filesystems which support IMMUTABLE.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
|
|
It's possible for pagecache writeback to split up a large amount of work
into smaller pieces for throttling purposes or to reduce the amount of
time a writeback operation is pending. Whatever the reason, XFS can end
up with a bunch of IO completions that call for the same operation to be
performed on a contiguous extent mapping. Since mappings are extent
based in XFS, we'd prefer to run fewer transactions when we can.
When we're processing an ioend on the list of io completions, check to
see if the next items on the list are both adjacent and of the same
type. If so, we can merge the completions to reduce transaction
overhead.
On fast storage this doesn't seem to make much of a difference in
performance, though the number of transactions for an overnight xfstests
run seems to drop by ~5%.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|
|
Now that we're no longer using m_data_workqueue, remove it.
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
|