summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-10-14xfs: cross-reference rmap records with refcount btreesscrub-strengthen-rmap-checking_2022-10-14Darrick J. Wong
Strengthen the rmap btree record checker a little more by comparing OWN_REFCBT reverse mappings against the refcount btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cross-reference rmap records with inode btreesDarrick J. Wong
Strengthen the rmap btree record checker a little more by comparing OWN_INOBT reverse mappings against the inode btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cross-reference rmap records with free space btreesDarrick J. Wong
Strengthen the rmap btree record checker a little more by comparing OWN_AG reverse mappings against the free space btrees, the rmap btree, and the AGFL. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: cross-reference rmap records with ag btreesDarrick J. Wong
Strengthen the rmap btree record checker a little more by comparing OWN_FS and OWN_LOG reverse mappings against the AG headers and internal logs, respectively. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: introduce bitmap type for AG blocksDarrick J. Wong
Create a typechecked bitmap for extents within an AG. Online repair uses bitmaps to store various different types of numbers, so let's make it obvious when we're storing xfs_agblock_t (and later xfs_fsblock_t) versus anything else. In subsequent patches, we're going to use agblock bitmaps to enhance the rmapbt checker to look for discrepancies between the rmapbt records and AG metadata block usage. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: convert xbitmap to interval treerepair-bitmap-rework_2022-10-14Darrick J. Wong
Convert the xbitmap code to use interval trees instead of linked lists. This reduces the amount of coding required to handle the disunion operation and in the future will make it easier to set bits in arbitrary order yet later be able to extract maximally sized extents, which we'll need for rebuilding certain structures. We define our own interval tree type so that it can deal with 64-bit indices even on 32-bit machines. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: drop the _safe behavior from the xbitmap foreach macroDarrick J. Wong
It's not safe to edit bitmap intervals while we're iterating them with for_each_xbitmap_extent. None of the existing callers actually need that ability anyway, so drop the safe variable. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: remove the for_each_xbitmap_ helpersDarrick J. Wong
Remove the for_each_xbitmap_ macros in favor of proper iterator functions. We'll soon be switching this data structure over to an interval tree implementation, which means that we can't allow callers to modify the bitmap during iteration without telling us. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: only allocate free space bitmap for xattr scrub if neededscrub-fix-xattr-memory-mgmt_2022-10-14Darrick J. Wong
The free space bitmap is only required if we're going to check the bestfree space at the end of an xattr leaf block. Therefore, we can reduce the memory requirements of this scrubber if we can determine that the xattr is in short format. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: clean up xattr scrub initializationDarrick J. Wong
Clean up local variable initialization and error returns in xchk_xattr. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: check used space of shortform xattr structuresDarrick J. Wong
Make sure that the records used inside a shortform xattr structure do not overlap. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: move xattr scrub buffer allocation to top level functionDarrick J. Wong
Move the xchk_setup_xattr_buf call from xchk_xattr_block to xchk_xattr, since we only need to set up the leaf block bitmaps once. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: remove flags argument from xchk_setup_xattr_bufDarrick J. Wong
All callers pass XCHK_GFP_FLAGS as the flags argument to xchk_setup_xattr_buf, so get rid of the argument. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: split valuebuf from xchk_xattr_buf.bufDarrick J. Wong
Move the xattr value buffer from somewhere in xchk_xattr_buf.buf[] to an explicit pointer. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: split usedmap from xchk_xattr_buf.bufDarrick J. Wong
Move the used space bitmap from somewhere in xchk_xattr_buf.buf[] to an explicit pointer. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: split freemap from xchk_xattr_buf.bufDarrick J. Wong
Move the free space bitmap from somewhere in xchk_xattr_buf.buf[] to an explicit pointer. This is the start of removing the complex overloaded memory buffer that is the source of weird memory misuse bugs. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: remove unnecessary dstmap in xattr scrubberDarrick J. Wong
Replace bitmap_and with bitmap_intersects in the xattr leaf block scrubber, since we only care if there's overlap between the used space bitmap and the free space bitmap. This means we don't need dstmap any more, and can thus reduce the memory requirements. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: check for reverse mapping records that could be mergedscrub-detect-mergeable-records_2022-10-14Darrick J. Wong
Enhance the rmap scrubber to flag adjacent records that could be merged. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: check overlapping rmap btree recordsDarrick J. Wong
The rmap btree scrubber doesn't contain sufficient checking for records that cannot overlap but do anyway. For the other btrees, this is enforced by the inorder checks in xchk_btree_rec, but the rmap btree is special because it allows overlapping records to handle shared data extents. Therefore, enhance the rmap btree record check function to compare each record against the previous one so that we can detect overlapping rmap records for space allocations that do not allow sharing. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: flag refcount btree records that could be mergedDarrick J. Wong
Complain if we encounter refcount btree records that could be merged. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: flag free space btree records that could be mergedDarrick J. Wong
Complain if we encounter free space btree records that could be merged. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: alert the user about data/attr fork mappings that could be mergedDarrick J. Wong
If the data or attr forks have mappings that could be merged, let the user know that the structure could be optimized. This isn't a filesystem corruption since the regular filesystem does not try to be smart about merging bmbt records. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: change bmap scrubber to store the previous mappingDarrick J. Wong
Convert the inode data/attr/cow fork scrubber to remember the entire previous mapping, not just the next expected offset. No behavior changes here, but this will enable some better checking in subsequent patches. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: always check the existence of a dirent's child inodescrub-dir-iget-fixes_2022-10-14Darrick J. Wong
When we're scrubbing directory entries, we always need to iget the child inode to make sure that the inode pointer points to a valid inode. The original directory scrub code (commit a5c4) only set us up to do this for ftype=1 filesystems, which is not sufficient; and then commit 4b80 made it worse by exempting the dot and dotdot entries. Sorta-fixes: a5c46e5e8912 ("xfs: scrub directory metadata") Sorta-fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: xfs_iget in the directory scrubber needs to use UNTRUSTEDDarrick J. Wong
In commit 4b80ac64450f, we tried to strengthen the directory scrubber by using the iget call to detect directory entries that point to unallocated inodes. Unfortunately, that commit neglected to pass XFS_IGET_UNTRUSTED to xfs_iget, so we don't check the inode btree first. If the inode number points to something that isn't even an inode cluster, iget will throw corruption errors and return -EFSCORRUPTED, which means that we fail to mark the directory corrupt. Fixes: 4b80ac64450f ("xfs: scrub should mark a directory corrupt if any entries cannot be iget'd") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: make checking directory dotdot entries more reliableDarrick J. Wong
The current directory parent scrubbing code could be tighter in its execution -- instead of bailing out to userspace after a couple of seconds of waiting for the (alleged) parent directory's IOLOCK while refusing to release the child directory's IOLOCK, we could just cycle both locks until we get both or the child process absorbs a fatal signal. Note that because the usual sequence is to take IOLOCKs before grabbing a transaction, we have to use the _nowait variants on both inodes to avoid an ABBA deadlock. Since parent pointer checking is the only place in scrub that needs this kind of functionality, move it to parent.c as a private function. Furthermore, if the child directory's parent changes during the lock cycling, we know that the new parent has stamped the correct parent into the dotdot entry, so we can conclude that the parent entry is correct. This eliminates an entire source of -EDEADLOCK-based "retry harder" scrub executions. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: retain the AGI when we can't iget an inode to scrub the corescrub-iget-fixes_2022-10-14Darrick J. Wong
xchk_get_inode is not quite the right function to be calling from the inode scrubber setup function. The common get_inode function either gets an inode and installs it in the scrub context, or it returns an error code explaining what happened. This is acceptable for most file scrubbers because it is not in their scope to fix corruptions in the inode core and fork areas that cause iget to fail. Dealing with these problems is within the scope of the inode scrubber, however. If iget fails with EFSCORRUPTED, we need to xchk_inode to flag that as corruption. Since we can't get our hands on an incore inode, we need to hold the AGI to prevent inode allocation activity so that nothing changes in the inode metadata. Looking ahead to the inode core repair patches, we will also need to hold the AGI buffer into xrep_inode so that we can make modifications to the xfs_dinode structure without any other thread swooping in to allocate or free the inode. Adapt the xchk_get_inode into xchk_setup_inode since this is a one-off use case where the error codes we check for are a little different, and the return state is much different from the common function. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: fix an inode lookup race in xchk_get_inodeDarrick J. Wong
In commit d658e, we tried to improve the robustnes of xchk_get_inode in the face of EINVAL returns from iget by calling xfs_imap to see if the inobt itself thinks that the inode is allocated. Unfortunately, that commit didn't consider the possibility that the inode gets allocated after iget but before imap. In this case, the imap call will succeed, but we turn that into a corruption error and tell userspace the inode is corrupt. Avoid this false corruption report by grabbing the AGI header and retrying the iget before calling imap. If the iget succeeds, we can proceed with the usual scrub-by-handle code. Fix all the incorrect comments too, since unreadable/corrupt inodes no longer result in EINVAL returns. Fixes: d658e72b4a09 ("xfs: distinguish between corrupt inode and invalid inum in xfs_scrub_get_inode") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: manage inode DONTCACHE status at irele timeDarrick J. Wong
Right now, there are statements scattered all over the online fsck codebase about how we can't use XFS_IGET_DONTCACHE because of concerns about scrub's unusual practice of releasing inodes with transactions held. However, iget is the wrong place to handle this -- the DONTCACHE state doesn't matter at all until we try to *release* the inode, and here we get things wrong in multiple ways: First, if we /do/ have a transaction, we must NOT drop the inode, because the inode could have dirty pages, dropping the inode will trigger writeback, and writeback can trigger a nested transaction. Second, if the inode already had an active reference and the DONTCACHE flag set, the icache hit when scrub grabs another ref will not clear DONTCACHE. This is sort of by design, since DONTCACHE is now used to initiate cache drops so that sysadmins can change a file's access mode between pagecache and DAX. Third, if we do actually have the last active reference to the inode, we can set DONTCACHE to avoid polluting the cache. This is the /one/ case where we actually want that flag. Create an xchk_irele helper to encode all that logic and switch the online fsck code to use it. Since this now means that nearly all scrubbers use the same xfs_iget flags, we can wrap them too. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: check inode core when scrubbing metadata filesscrub-check-metadata-inode-records_2022-10-14Darrick J. Wong
Metadata files (e.g. realtime bitmaps and quota files) do not show up in the bulkstat output, which means that scrub-by-handle does not work; they can only be checked through a specific scrub type. Therefore, each scrub type calls xchk_metadata_inode_forks to check the metadata for whatever's in the file. Unfortunately, that function doesn't actually check the inode record itself. Refactor the function a bit to make that happen. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: don't warn about files that are exactly s_maxbytes longDarrick J. Wong
We can handle files that are exactly s_maxbytes bytes long; we just can't handle anything larger than that. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: ensure that single-owner file blocks are not owned by othersscrub-detect-rmapbt-gaps_2022-10-14Darrick J. Wong
For any file fork mapping that can only have a single owner, make sure that there are no other rmap owners for that mapping. This patch requires the more detailed checking provided by xfs_rmap_count_owners so that we can know how many rmap records for a given range of space had a matching owner, how many had a non-matching owner, and how many conflicted with the records that have a matching owner. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: teach scrub to check for sole ownership of metadata objectsDarrick J. Wong
Strengthen online scrub's checking even further by enabling us to check that a range of blocks are owned solely by a given owner. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: convert xfs_ialloc_has_inodes_at_extent to return keyfill scan resultsscrub-detect-inobt-gaps_2022-10-14Darrick J. Wong
Convert the xfs_ialloc_has_inodes_at_extent function to return keyfill scan results because for a given range of inode numbers, we might have no indexed inodes at all; the entire region might be allocated ondisk inodes; or there might be a mix of the two. Unfortunately, sparse inodes adds to the complexity, because each inode record can have holes, which means that we cannot use the generic btree _scan_keyfill function because we must look for holes in individual records to decide the result. On the plus side, online fsck can now detect sub-chunk discrepancies in the inobt. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: directly cross-reference the inode btrees with each otherDarrick J. Wong
Improve the cross-referencing of the two inode btrees by directly checking the free and hole state of each inode with the other btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: clean up broken eearly-exit code in the inode btree scrubberDarrick J. Wong
Corrupt inode chunks should cause us to exit early after setting the CORRUPT flag on the scrub state. While we're at it, collapse trivial helpers. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: ensure that all metadata and data blocks are not cow staging extentsscrub-detect-refcount-gaps_2022-10-14Darrick J. Wong
Make sure that all filesystem metadata blocks and file data blocks are not also marked as CoW staging extents. The extra checking added here was inspired by an actual VM host filesystem corruption incident due to bugs in the CoW handling of 4.x kernels. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: check the reference counts of gaps in the refcount btreeDarrick J. Wong
Gaps in the reference count btree are also significant -- for these regions, there must not be any overlapping reverse mappings. We don't currently check this, so make the refcount scrubber more complete. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: mask key comparisons for keyspace fill scansDarrick J. Wong
For keyspace fullness scans, we want to be able to mask off the parts of the key that we don't care about. For most btree types we /do/ want the full keyspace, but for checking that a given space usage also has a full complement of rmapbt records (even if different/multiple owners) we need this masking so that we only track sparseness of rm_startblock, not the whole keyspace (which is extremely sparse). Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: refactor converting btree irec to btree keyDarrick J. Wong
We keep doing these conversions to support btree queries, so refactor this into a helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: replace xfs_btree_has_record with a general keyspace scannerDarrick J. Wong
The current implementation of xfs_btree_has_record returns true if it finds /any/ record within the given range. Unfortunately, that's not sufficient for scrub. We want to be able to tell if a range of keyspace for a btree is devoid of records, is totally mapped to records, or is somewhere in between. By forcing this to be a boolean, we were generally missing the "in between" case and returning incorrect results. Fix the API so that we can tell the caller which of those three is the current state. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: check btree keys reflect the child blockbtree-key-enhancements_2022-10-14Darrick J. Wong
When scrub is checking a non-root btree block, it should make sure that the keys in the parent btree block accurately capture the keyspace that the child block stores. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: fix rmap key comparison functionsDarrick J. Wong
Keys for extent interval records in the reverse mapping btree are supposed to be computed as follows: (physical block, owner, fork, is_btree, offset) This provides users the ability to look up a reverse mapping from a file block mapping record -- start with the physical block; then if there are multiple records for the same block, move on to the owner; then the inode fork type; and so on to the file offset. However, the key comparison functions incorrectly remove the fork/bmbt information that's encoded in the on-disk offset. This means that lookup comparisons are only done with: (physical block, owner, offset) This means that queries can return incorrect results. On consistent filesystems this isn't an issue because bmbt blocks and blocks mapped to an attr fork cannot be shared, but this prevents us from detecting incorrect fork and bmbt flag bits in the rmap btree. A previous version of this patch forgot to keep the (un)written state flag masked during the comparison and caused a major regression in 5.9.x since unwritten extent conversion can update an rmap record without requiring key updates. Note that blocks cannot go directly from data fork to attr fork without being deallocated and reallocated, nor can they be added to or removed from a bmbt without a free/alloc cycle, so this should not cause any regressions. Found by fuzzing keys[1].attrfork = ones on xfs/371. Fixes: 4b8ed67794fe ("xfs: add rmap btree operations") Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: make sure aglen never goes negative in xfs_refcount_adjust_extentsrefcount-cow-domain_2022-10-14Darrick J. Wong
Prior to calling xfs_refcount_adjust_extents, we trimmed agbno/aglen such that the end of the range would not be in the middle of the record. If this is no longer the case, something is seriously wrong with the btree. Bail out with a corruption error. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: check deferred refcount op continuation parametersDarrick J. Wong
If we're in the middle of a deferred refcount operation and decide to roll the transaction to avoid overflowing the transaction space, we need to check the new agbno/aglen parameters that we're about to record in the new intent. Specifically, we need to check that the new extent is completely within the filesystem, and that continuation does not put us into a different AG. This should never happen, but if the keys of a node block are wrong, the refcount btree lookups performed during the adjust operation (and resumptions therein) can point us to the wrong record blocks. The refcount domain should prevent most of this, but this is a convenient place to double-check that everything is still ok. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: rename XFS_REFC_COW_START to _COWFLAGDarrick J. Wong
We've been (ab)using XFS_REFC_COW_START as both an integer quantity and a bit flag, even though it's *only* a bit flag. Rename the variable to reflect its nature and update the cast target since we're not supposed to be comparing it to xfs_agblock_t now. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: track cow/shared record domains explicitly in xfs_refcount_irecDarrick J. Wong
Just prior to committing the reflink code into upstream, the xfs maintainer at the time requested that I find a way to shard the refcount records into two domains -- one for records tracking shared extents, and a second for tracking CoW staging extents. The idea here was to minimize mount time CoW reclamation by pushing all the CoW records to the right edge of the keyspace, and it was accomplished by setting the upper bit in rc_startblock. We don't allow AGs to have more than 2^31 blocks, so the bit was free. Unfortunately, this was a very late addition to the codebase, so most of the refcount record processing code still treats rc_startblock as a u32 and pays no attention to whether or not the upper bit (the cow flag) is set. This is a weakness is theoretically exploitable, since we're not fully validating the incoming metadata records. Fuzzing demonstrates practical exploits of this weakness. If the cow flag of a node block key record is corrupted, a lookup operation can go to the wrong record block and start returning records from the wrong cow/shared domain. This causes the math to go all wrong (since cow domain is still implicit in the upper bit of rc_startblock) and we can crash the kernel by tricking xfs into jumping into a nonexistent AG and tripping over xfs_perag_get(mp, <nonexistent AG>) returning NULL. To fix this, start tracking the domain as an explicit part of struct xfs_refcount_irec, adjust all refcount functions to check the domain of a returned record, and alter the function definitions to accept them where necessary. Found by fuzzing keys[2].cowflag = add in xfs/464. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: refactor refcount record usage in xchk_refcountbt_recDarrick J. Wong
Consolidate the open-coded xfs_refcount_irec fields into an actual struct and use the existing _btrec_to_irec to decode the ondisk record. This will reduce code churn in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: move _irec structs to xfs_types.hDarrick J. Wong
Structure definitions for incore objects do not belong in the ondisk format header. Move them to the incore types header where they belong. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-10-14xfs: teach scrub to flag non-extents format cow forksscrub-bmap-enhancements_2022-10-14Darrick J. Wong
CoW forks only exist in memory, which means that they can only ever have an incore extent tree. Hence they must always be FMT_EXTENTS, so check this when we're scrubbing them. Signed-off-by: Darrick J. Wong <djwong@kernel.org>