summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-12-15xfs: convert symlink repair to use swapextrepair-symlink-swapext_2021-12-15Darrick J. Wong
Convert the symlink repair code to use extent swapping. This means we can eliminate the problem of symlinks with crosslinked blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: merge the two symlink inode_operationsDarrick J. Wong
It's really awkward that we maintain two separate VFS inode operations structures and functions for handling symlinks, because that means online repair potentially has to switch iops after fixing a symlink. Merge them into one to obviate the need for this. Furthermore, this fixes the issue of the VFS readlink functions retaining a pointer to if_data after the inode has been recycled; protects XFS against the VFS altering the contents of internal metadata buffers (if_data); and shuts down complaints about us accessing the inode data fork without taking any locks. In the past, we've skirted these issues because symlink targets are basically immutable, but this won't be the case with online repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: move symlink target write function to libxfsDarrick J. Wong
Move xfs_symlink_write_target to xfs_symlink_remote.c so that kernel and mkfs can share the same function. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: move remote symlink target read function to libxfsDarrick J. Wong
Move xfs_readlink_bmap_ilocked to xfs_symlink_remote.c so that the swapext code can use it to convert a remote format symlink back to shortform format after a metadata repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: move xfs_symlink_remote.c declarations to xfs_symlink_remote.hDarrick J. Wong
Move declarations for libxfs symlink functions into a separate header file like we do for most everything else. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: remove XFS_IGET_INCOREDarrick J. Wong
Nobody uses it, so get rid of it. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: don't freeze fs for rmap repairsrepair-rmap-live_2021-12-15Darrick J. Wong
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: hook live realtime rmap operations during a repair operationDarrick J. Wong
Hook the regular realtime rmap code when an rtrmapbt repair operation is running so that we can unlock the AGF buffer to scan the filesystem and keep the in-memory btree up to date during the scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: hook live rmap operations during a repair operationDarrick J. Wong
Hook the regular rmap code when an rmapbt repair operation is running so that we can unlock the AGF buffer to scan the filesystem and keep the in-memory btree up to date during the scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: create a shadow rmap btree during realtime rmap repairDarrick J. Wong
Create an in-memory btree of rmap records instead of an array. This enables us to do live record collection instead of freezing the fs. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: create a shadow rmap btree during rmap repairDarrick J. Wong
Create an in-memory btree of rmap records instead of an array. This enables us to do live record collection instead of freezing the fs. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: connect in-memory btrees to xfilesDarrick J. Wong
Add to our stubbed-out in-memory btrees the ability to connect them with an actual in-memory backing file (aka xfiles) and the necessary pieces to track free space in the xfile and flush dirty xfbtree buffers on demand, which we'll need for online repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: support in-memory btreesDarrick J. Wong
Adapt the generic btree cursor code to be able to create a btree whose buffers come from a (presumably in-memory) buftarg with a header block that's specific to in-memory btrees. We'll connect this to other parts of online scrub in the next patches. Note that in-memory btrees always have a block size matching the system memory page size for efficiency reasons. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: support in-memory buffer cache targetsDarrick J. Wong
Allow the buffer cache to target in-memory files by connecting it to xfiles. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: teach buftargs to maintain their own buffer hashtableDarrick J. Wong
Currently, cached buffers are indexed by per-AG hashtables. This works great for the data device, but won't work for in-memory btrees. Make it so that buftargs can index buffers too. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: dump xfiles for debugging purposesDarrick J. Wong
Add a debug function to dump an xfile's contents for debug purposes. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: allow userspace to rebuild metadata structuresdefrag-freespace_2021-12-15Darrick J. Wong
Add a new (superuser-only) flag to the online metadata repair ioctl to force it to rebuild structures, even if they're not broken. We will use this to move metadata structures out of the way during a free space defragmentation operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: fallocate free space into a fileDarrick J. Wong
Add a new fallocate mode to map free physical space into a file, at the same file offset as if the file were a sparse image of the physical device backing the filesystem. The intent here is to use this to prototype a free space defragmentation tool. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: capture the offset and length in fallocate tracepointsDarrick J. Wong
Change the class of the fallocate tracepoints to capture the offset and length of the requested operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: export reference count information to userspacereport-refcounts_2021-12-15Darrick J. Wong
Export refcount info to userspace so we can prototype a sharing-aware defrag/fs rearranging tool. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: allow queued AG intents to drain before scrubbingscrub-drain-intents_2021-12-15Darrick J. Wong
Currently, online scrub isn't sufficiently careful about quiescing allocation groups before checking them. While scrub does take the AG header locks, it doesn't serialize against chains of AG update intents that are being processed concurrently. If there's a collision, cross-referencing between data structures (e.g. rmapbt and refcountbt) can yield false corruption events; if repair is running, this results in incorrect repairs. Fix this by adding to the perag structure the count of active intents and make scrub wait until there aren't any to continue. This is a little stupid since transactions can queue intents without taking buffer locks, but we'll also wait for those transactions. XXX: should have instead a per-ag rwsem that gets taken as soon as the AG[IF] are locked and stays held until the transaction commits or moves on to the next AG? would we rather have a six lock so that intents can take an ix lock, and not have to upgrade to x until we actually want to make changes to that ag? is that how those even work?? Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: move files to orphanage instead of letting nlinks drop to zeroDarrick J. Wong
If we encounter an inode with a nonzero link count but zero observed links, move it to the orphanage. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: teach repair to fix file nlinksscrub-nlinks_2021-12-15Darrick J. Wong
Fix the nlinks now too. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: teach scrub to check file nlinksDarrick J. Wong
Copy-pasta the online quotacheck code to check inode link counts too. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: report health of inode link countsDarrick J. Wong
Report on the health of the inode link counts. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: experiment with dontcache when scanning inodesvectorized-scrub_2021-12-15Darrick J. Wong
Add some experimental flags to drop inodes from the cache after a scan. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: introduce vectored scrub modeDarrick J. Wong
Introduce a variant on XFS_SCRUB_METADATA that allows for vectored mode. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: whine to dmesg when we encounter errorsDarrick J. Wong
Forward everything scrub whines about to dmesg. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: track deferred ops statisticsDarrick J. Wong
Track some basic statistics on how hard we're pushing the defer ops. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: Don't free EOF blocks on close when extent size hints are setreduce-eofblocks-gc-on-close_2021-12-15Dave Chinner
When we have a workload that does open/write/close on files with extent size hints set in parallel with other allocation, the file becomes rapidly fragmented. This is due to close() calling xfs_release() and removing the preallocated extent beyond EOF. This occurs for both buffered and direct writes that append to files with extent size hints. The existing open/write/close hueristic in xfs_release() does not catch this as writes to files using extent size hints do not use delayed allocation and hence do not leave delayed allocation blocks allocated on the inode that can be detected in xfs_release(). Hence XFS_IDIRTY_RELEASE never gets set. In xfs_file_release(), we can tell whether the inode has extent size hints set and skip EOF block truncation. We add this check to xfs_can_free_eofblocks() so that we treat the post-EOF preallocated extent like intentional preallocation and so are persistent unless directly removed by userspace. Before: Test 2: Extent size hint fragmentation counts /mnt/scratch/file.0: 1002 /mnt/scratch/file.1: 1002 /mnt/scratch/file.2: 1002 /mnt/scratch/file.3: 1002 /mnt/scratch/file.4: 1002 /mnt/scratch/file.5: 1002 /mnt/scratch/file.6: 1002 /mnt/scratch/file.7: 1002 After: Test 2: Extent size hint fragmentation counts /mnt/scratch/file.0: 4 /mnt/scratch/file.1: 4 /mnt/scratch/file.2: 4 /mnt/scratch/file.3: 4 /mnt/scratch/file.4: 4 /mnt/scratch/file.5: 4 /mnt/scratch/file.6: 4 /mnt/scratch/file.7: 4 Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: don't free EOF blocks on read closeDave Chinner
When we have a workload that does open/read/close in parallel with other allocation, the file becomes rapidly fragmented. This is due to close() calling xfs_release() and removing the speculative preallocation beyond EOF. The existing open/*/close heuristic in xfs_release() does not catch this as a sync writer does not leave delayed allocation blocks allocated on the inode for later writeback that can be detected in xfs_release() and hence XFS_IDIRTY_RELEASE never gets set. In xfs_file_release(), we know more about the released file context, and so we need to communicate some of the details to xfs_release() so it can do the right thing here and skip EOF block truncation. This defers the EOF block cleanup for synchronous write contexts to the background EOF block cleaner which will clean up within a few minutes. Before: Test 1: sync write fragmentation counts /mnt/scratch/file.0: 919 /mnt/scratch/file.1: 916 /mnt/scratch/file.2: 919 /mnt/scratch/file.3: 920 /mnt/scratch/file.4: 920 /mnt/scratch/file.5: 921 /mnt/scratch/file.6: 916 /mnt/scratch/file.7: 918 After: Test 1: sync write fragmentation counts /mnt/scratch/file.0: 24 /mnt/scratch/file.1: 24 /mnt/scratch/file.2: 11 /mnt/scratch/file.3: 24 /mnt/scratch/file.4: 3 /mnt/scratch/file.5: 24 /mnt/scratch/file.6: 24 /mnt/scratch/file.7: 23 Signed-off-by: Dave Chinner <dchinner@redhat.com> [darrick: wordsmithing, fix commit message] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: only free posteof blocks on first closeDarrick J. Wong
Certain workloads fragment files on XFS very badly, such as a software package that creates a number of threads, each of which repeatedly run the sequence: open a file, perform a synchronous write, and close the file, which defeats the speculative preallocation mechanism. We work around this problem by only deleting posteof blocks the /first/ time a file is closed to preserve the behavior that unpacking a tarball lays out files one after the other with no gaps. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: enable realtime quota againrealtime-quotas_2021-12-15Darrick J. Wong
Enable quotas for the realtime device. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: fix rt growfs quota accountingDarrick J. Wong
When growing the realtime bitmap or summary inodes, use xfs_trans_alloc_inode to reserve quota for the blocks that could be allocated to the file. Although we never enforce limits against the root dquot, making a reservation means that the bmap code will update the quota block count, which is necessary for correct accounting. Found by running xfs/521. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: fix chown with rt quotaDarrick J. Wong
Make chown's quota adjustments work with realtime files. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: support realtime reflink with an extent size that isn't a power of 2Darrick J. Wong
Add the necessary alignment checking code to the reflink remap code to ensure that remap requests are aligned to rt extent boundaries if the realtime extent size isn't a power of two. The VFS helpers assume that they can use the usual (blocksize - 1) masking to avoid slow 64-bit division, but since XFS is special we won't make everyone pay that cost for our weird edge case. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: allow reflink on the rt volume when extent size is larger than 1 rt blockrealtime-reflink-extsize_2021-12-15Darrick J. Wong
Make the necessary tweaks to the reflink remapping code to support remapping on the realtime volume when the rt extent size is larger than a single rt block. We need to check that the remap arguments from userspace are aligned to a rt extent boundary, and that the length is always aligned, even if the kernel tried to round it up to EOF for us. XFS can only map and remap full rt extents, so we have to be a little more strict about the alignment there. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: enable extent size hints for CoW when rtextsize > 1Darrick J. Wong
CoW extent size hints are not allowed on filesystems that have large realtime extents because we only want to perform the minimum required amount of write-around (aka write amplification) for shared extents. On filesystems where rtextsize > 1, allocations can only be done in units of full rt extents, which means that we can only map an entire rt extent's worth of blocks into the data fork. Hole punch requests become conversions to unwritten if the request isn't aligned properly. Because a copy-write fundamentally requires remapping, this means that we also can only do copy-writes of a full rt extent. This is too expensive for large hint sizes, since it's all or nothing. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: extend writeback requests to handle rt cow correctlyDarrick J. Wong
If we have shared realtime files and the rt extent size is larger than a single fs block, we need to extend writeback requests to be aligned to rt extent size granularity because we cannot share partial rt extents. The front end should have set us up for this by dirtying the relevant ranges. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: forcibly convert unwritten blocks within an rt extent before sharingDarrick J. Wong
As noted in the previous patch, XFS can only unmap and map full rt extents. This means that we cannot stop mid-extent for any reason, including stepping around unwritten/written extents. Second, the reflink and CoW mechanisms were not designed to handle shared unwritten extents, so we have to do something to get rid of them. If the user asks us to remap two files, we must scan both ranges beforehand to convert any unwritten extents that are not aligned to rt extent boundaries into zeroed written extents before sharing. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: enable CoW when rt extent size is larger than 1 blockDarrick J. Wong
Copy on write encounters a major plot twist when the file being CoW'd lives on the realtime volume and the realtime extent size is larger than a single filesystem block. XFS can only unmap and remap full rt extents, which means that allocations are always done in units of full rt extents, and a request to unmap less than one extent is treated as a request to convert an extent to unwritten status. This behavioral quirk is not compatible with the existing CoW mechanism, so we have to intercept every path through which files can be modified to ensure that we dirty an entire rt extent at once so that we can remap a full rt extent. Use the existing VFS unshare functions to dirty the page cache to set that up. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15iomap: set up for COWing around pagesDarrick J. Wong
In anticipation of enabling reflink on the realtime volume where the allocation unit is larger than a page, create an iomap function to dirty arbitrary parts of a file's page cache so that when we dirty part of a file that could undergo a COW extent, we can dirty an entire allocation unit's worth of pages. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15vfs: explicitly pass the block size to the remap prep functionDarrick J. Wong
Make it so that filesystems can pass an explicit blocksize to the remap prep function. This enables filesystems whose fundamental allocation units are /not/ the same as the blocksize to ensure that the remapping checks are aligned properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: enable realtime reflinkrealtime-reflink_2021-12-15Darrick J. Wong
Enable reflink for realtime devices, sort of. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: repair inodes that have a refcount btree in the data forkDarrick J. Wong
Plumb knowledge of refcount btrees into the inode core repair code. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: online repair of the realtime refcount btreeDarrick J. Wong
Port the data device's refcount btree repair code to the realtime refcount btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: capture realtime CoW staging extents when rebuilding rt rmapbtDarrick J. Wong
Walk the realtime refcount btree to find the CoW staging extents when we're rebuilding the realtime rmap btree. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: walk the rt reference count tree when rebuilding rmapDarrick J. Wong
When we're rebuilding the data device rmap, if we encounter a "refcount" format fork, we have to walk the (realtime) refcount btree inode to build the appropriate mappings. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: check new rtbitmap records against rt refcount btreeDarrick J. Wong
When we're rebuilding the realtime bitmap, check the proposed free extents against the rt refcount btree to make sure we don't commit any grievous errors. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-15xfs: detect and repair misaligned rtinherit directory cowextsize hintsDarrick J. Wong
If we encounter a directory that has been configured to pass on a CoW extent size hint to a new realtime file and the hint isn't an integer multiple of the rt extent size, we should flag the hint for administrative review and/or turn it off because that is a misconfiguration. Signed-off-by: Darrick J. Wong <djwong@kernel.org>