summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2022-01-11xfs: take the ILOCK when readdir inspects directory mapping dataxfs-5.17-merge_2022-01-11Darrick J. Wong
I was poking around in the directory code while diagnosing online fsck bugs, and noticed that xfs_readdir doesn't actually take the directory ILOCK when it calls xfs_dir2_isblock. xfs_dir_open most probably loaded the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so we're protected against writer threads, but we really need to follow the locking model like we do in other places. To avoid unnecessarily cycling the ILOCK for fairly small directories, change the block/leaf _getdents functions to consume the ILOCK hold that the parent readdir function took to decide on a _getdents implementation. It is ok to cycle the ILOCK in readdir because the VFS takes the IOLOCK in the appropriate mode during lookups and writes, and we don't want to be holding the ILOCK when we copy directory entries to userspace in case there's a page fault. We really only need it to protect against data fork lookups, like we do for other files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: warn about inodes with project id of -1xfs-5.17-merge-2Darrick J. Wong
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but the kernel hasn't always validated FSSETXATTR correctly. Flag this as something for the sysadmin to check out. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: hold quota inode ILOCK_EXCL until the end of dqallocDarrick J. Wong
Online fsck depends on callers holding ILOCK_EXCL from the time they decide to update a block mapping until after they've updated the reverse mapping records to guarantee the stability of both mapping records. Unfortunately, the quota code drops ILOCK_EXCL at the first transaction roll in the dquot allocation process, which breaks that assertion. This leads to sporadic failures in the online rmap repair code if the repair code grabs the AGF after bmapi_write maps a new block into the quota file's data fork but before it can finish the deferred rmap update. Fix this by rewriting the function to hold the ILOCK until after the transaction commit like all other bmap updates do, and get rid of the dqread wrapper that does nothing but complicate the codebase. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: Remove redundant assignment of mpJiapeng Chong
mp is being initialized to log->l_mp but this is never read as record is overwritten later on. Remove the redundant assignment. Cleans up the following clang-analyzer warning: fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06xfs: reduce kvmalloc overhead for CIL shadow buffersDave Chinner
Oh, let me count the ways that the kvmalloc API sucks dog eggs. The problem is when we are logging lots of large objects, we hit kvmalloc really damn hard with costly order allocations, and behaviour utterly sucks: - 49.73% xlog_cil_commit - 31.62% kvmalloc_node - 29.96% __kmalloc_node - 29.38% kmalloc_large_node - 29.33% __alloc_pages - 24.33% __alloc_pages_slowpath.constprop.0 - 18.35% __alloc_pages_direct_compact - 17.39% try_to_compact_pages - compact_zone_order - 15.26% compact_zone 5.29% __pageblock_pfn_to_page 3.71% PageHuge - 1.44% isolate_migratepages_block 0.71% set_pfnblock_flags_mask 1.11% get_pfnblock_flags_mask - 0.81% get_page_from_freelist - 0.59% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 3.24% try_to_free_pages - 3.14% shrink_node - 2.94% shrink_slab.constprop.0 - 0.89% super_cache_count - 0.66% xfs_fs_nr_cached_objects - 0.65% xfs_reclaim_inodes_count 0.55% xfs_perag_get_tag 0.58% kfree_rcu_shrink_count - 2.09% get_page_from_freelist - 1.03% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 4.88% get_page_from_freelist - 3.66% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 1.63% __vmalloc_node - __vmalloc_node_range - 1.10% __alloc_pages_bulk - 0.93% __alloc_pages - 0.92% get_page_from_freelist - 0.89% rmqueue_bulk - 0.69% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath 13.73% memcpy_erms - 2.22% kvfree On this workload, that's almost a dozen CPUs all trying to compact and reclaim memory inside kvmalloc_node at the same time. Yet it is regularly falling back to vmalloc despite all that compaction, page and shrinker reclaim that direct reclaim is doing. Copying all the metadata is taking far less CPU time than allocating the storage! Direct reclaim should be considered extremely harmful. This is a high frequency, high throughput, CPU usage and latency sensitive allocation. We've got memory there, and we're using kvmalloc to allow memory allocation to avoid doing lots of work to try to do contiguous allocations. Except it still does *lots of costly work* that is unnecessary. Worse: the only way to avoid the slowpath page allocation trying to do compaction on costly allocations is to turn off direct reclaim (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags). Unfortunately, the stupid kvmalloc API then says "oh, this isn't a GFP_KERNEL allocation context, so you only get kmalloc!". This cuts off the vmalloc fallback, and this leads to almost instant OOM problems which ends up in filesystems deadlocks, shutdowns and/or kernel crashes. I want some basic kvmalloc behaviour: - kmalloc for a contiguous range with fail fast semantics - no compaction direct reclaim if the allocation enters the slow path. - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails The really, really stupid part about this is these kvmalloc() calls are run under memalloc_nofs task context, so all the allocations are always reduced to GFP_NOFS regardless of the fact that kvmalloc requires GFP_KERNEL to be passed in. IOWs, we're already telling kvmalloc to behave differently to the gfp flags we pass in, but it still won't allow vmalloc to be run with anything other than GFP_KERNEL. So, this patch open codes the kvmalloc() in the commit path to have the above described behaviour. The result is we more than halve the CPU time spend doing kvmalloc() in this path and transaction commits with 64kB objects in them more than doubles. i.e. we get ~5x reduction in CPU usage per costly-sized kvmalloc() invocation and the profile looks like this: - 37.60% xlog_cil_commit 16.01% memcpy_erms - 8.45% __kmalloc - 8.04% kmalloc_order_trace - 8.03% kmalloc_order - 7.93% alloc_pages - 7.90% __alloc_pages - 4.05% __alloc_pages_slowpath.constprop.0 - 2.18% get_page_from_freelist - 1.77% wake_all_kswapds .... - __wake_up_common_lock - 0.94% _raw_spin_lock_irqsave - 3.72% get_page_from_freelist - 2.43% _raw_spin_lock_irqsave - 5.72% vmalloc - 5.72% __vmalloc_node_range - 4.81% __get_vm_area_node.constprop.0 - 3.26% alloc_vmap_area - 2.52% _raw_spin_lock - 1.46% _raw_spin_lock 0.56% __alloc_pages_bulk - 4.66% kvfree - 3.25% vfree - __vfree - 3.23% __vunmap - 1.95% remove_vm_area - 1.06% free_vmap_area_noflush - 0.82% _raw_spin_lock - 0.68% _raw_spin_lock - 0.92% _raw_spin_lock - 1.40% kfree - 1.36% __free_pages - 1.35% __free_pages_ok - 1.02% _raw_spin_lock_irqsave It's worth noting that over 50% of the CPU time spent allocating these shadow buffers is now spent on spinlocks. So the shadow buffer allocation overhead is greatly reduced by getting rid of direct reclaim from kmalloc, and could probably be made even less costly if vmalloc() didn't use global spinlocks to protect it's structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06xfs: sysfs: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the xfs sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: linux-xfs@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-22xfs: prevent UAF in xfs_log_item_in_current_chkptDarrick J. Wong
While I was running with KASAN and lockdep enabled, I stumbled upon an KASAN report about a UAF to a freed CIL checkpoint. Looking at the comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me that the original patch to xfs_defer_finish_noroll should have done something to lock the CIL to prevent it from switching the CIL contexts while the predicate runs. For upper level code that needs to know if a given log item is new enough not to need relogging, add a new wrapper that takes the CIL context lock long enough to sample the current CIL context. This is kind of racy in that the CIL can switch the contexts immediately after sampling, but that's ok because the consequence is that the defer ops code is a little slow to relog items. ================================================================== BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs] Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999 CPU: 1 PID: 527999 Comm: fsstress Tainted: G D 5.16.0-rc4-xfsx #rc4 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x140 kasan_report.cold+0x83/0xdf xfs_log_item_in_current_chkpt+0x139/0x160 xfs_defer_finish_noroll+0x3bb/0x1e30 __xfs_trans_commit+0x6c8/0xcf0 xfs_reflink_remap_extent+0x66f/0x10e0 xfs_reflink_remap_blocks+0x2dd/0xa90 xfs_file_remap_range+0x27b/0xc30 vfs_dedupe_file_range_one+0x368/0x420 vfs_dedupe_file_range+0x37c/0x5d0 do_vfs_ioctl+0x308/0x1260 __x64_sys_ioctl+0xa1/0x170 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2c71a2950b Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004 RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20 </TASK> Allocated by task 464064: kasan_save_stack+0x1e/0x50 __kasan_kmalloc+0x81/0xa0 kmem_alloc+0xcd/0x2c0 [xfs] xlog_cil_ctx_alloc+0x17/0x1e0 [xfs] xlog_cil_push_work+0x141/0x13d0 [xfs] process_one_work+0x7f6/0x1380 worker_thread+0x59d/0x1040 kthread+0x3b0/0x490 ret_from_fork+0x1f/0x30 Freed by task 51: kasan_save_stack+0x1e/0x50 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xed/0x130 slab_free_freelist_hook+0x7f/0x160 kfree+0xde/0x340 xlog_cil_committed+0xbfd/0xfe0 [xfs] xlog_cil_process_committed+0x103/0x1c0 [xfs] xlog_state_do_callback+0x45d/0xbd0 [xfs] xlog_ioend_work+0x116/0x1c0 [xfs] process_one_work+0x7f6/0x1380 worker_thread+0x59d/0x1040 kthread+0x3b0/0x490 ret_from_fork+0x1f/0x30 Last potentially related work creation: kasan_save_stack+0x1e/0x50 __kasan_record_aux_stack+0xb7/0xc0 insert_work+0x48/0x2e0 __queue_work+0x4e7/0xda0 queue_work_on+0x69/0x80 xlog_cil_push_now.isra.0+0x16b/0x210 [xfs] xlog_cil_force_seq+0x1b7/0x850 [xfs] xfs_log_force_seq+0x1c7/0x670 [xfs] xfs_file_fsync+0x7c1/0xa60 [xfs] __x64_sys_fsync+0x52/0x80 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88804ea5f600 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 8 bytes inside of 256-byte region [ffff88804ea5f600, ffff88804ea5f700) The buggy address belongs to the page: page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e head:ffffea00013a9780 order:1 compound_mapcount:0 flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff) raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40 raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fixes: 4e919af7827a ("xfs: periodically relog deferred intent items") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()xfs-5.17-merge-1Dan Carpenter
The "bufsize" comes from the root user. If "bufsize" is negative then, because of type promotion, neither of the validation checks at the start of the function are able to catch it: if (bufsize < sizeof(struct xfs_attrlist) || bufsize > XFS_XATTR_LIST_MAX) return -EINVAL; This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in kvmalloc_node(). Fix this by changing the type from int to size_t. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: Fix comments mentioning xfs_iallocYang Xu
Since kernel commit 1abcf261016e ("xfs: move on-disk inode allocation out of xfs_ialloc()"), xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments. Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: check sb_meta_uuid for dabuf buffer recoveryDave Chinner
Got a report that a repeated crash test of a container host would eventually fail with a log recovery error preventing the system from mounting the root filesystem. It manifested as a directory leaf node corruption on writeback like so: XFS (loop0): Mounting V5 Filesystem XFS (loop0): Starting recovery (logdev: internal) XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b ........=....... 00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc .......X...).... 00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23 ..x..~J}.S...G.# 00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00 .........C...... 00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a ................ 00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50 .5y....0.......P 00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4 .@.......A...... 00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c .b.......P!A.... XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514). Shutting down. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed Tracing indicated that we were recovering changes from a transaction at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57. That is, log recovery was overwriting a buffer with newer changes on disk than was in the transaction. Tracing indicated that we were hitting the "recovery immediately" case in xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the buffer. The code was extracting the LSN correctly, then ignoring it because the UUID in the buffer did not match the superblock UUID. The problem arises because the UUID check uses the wrong UUID - it should be checking the sb_meta_uuid, not sb_uuid. This filesystem has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the correct matching sb_meta_uuid in it, it's just the code checked it against the wrong superblock uuid. The is no corruption in the filesystem, and failing to recover the buffer due to a write verifier failure means the recovery bug did not propagate the corruption to disk. Hence there is no corruption before or after this bug has manifested, the impact is limited simply to an unmountable filesystem.... This was missed back in 2015 during an audit of incorrect sb_uuid usage that resulted in commit fcfbe2c4ef42 ("xfs: log recovery needs to validate against sb_meta_uuid") that fixed the magic32 buffers to validate against sb_meta_uuid instead of sb_uuid. It missed the magicda buffers.... Fixes: ce748eaa65f2 ("xfs: create new metadata UUID field and incompat flag") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: fix a bug in the online fsck directory leaf1 bestcount checkDarrick J. Wong
When xfs_scrub encounters a directory with a leaf1 block, it tries to validate that the leaf1 block's bestcount (aka the best free count of each directory data block) is the correct size. Previously, this author believed that comparing bestcount to the directory isize (since directory data blocks are under isize, and leaf/bestfree blocks are above it) was sufficient. Unfortunately during testing of online repair, it was discovered that it is possible to create a directory with a hole between the last directory block and isize. The directory code seems to handle this situation just fine and xfs_repair doesn't complain, which effectively makes this quirk part of the disk format. Fix the check to work properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: only run COW extent recovery when there are no live extentsDarrick J. Wong
As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. In the first patch, we fixed the race condition in (2) so that (A) will always flush the COW fork. In this second patch, we move the _recover_cow call to the initial mount call in (0) for safety. As mentioned previously, xfs_reflink_recover_cow walks the refcount btree looking for COW staging extents, and frees them. This was intended to be run at mount time (when we know there are no live inodes) to clean up any leftover staging events that may have been left behind during an unclean shutdown. As a time "optimization" for readonly mounts, we deferred this to the ro->rw transition, not realizing that any failure to clean all COW forks during a rw->ro transition would result in catastrophic corruption. Therefore, remove this optimization and only run the recovery routine when we're guaranteed not to have any COW staging extents anywhere, which means we always run this at mount time. While we're at it, move the callsite to xfs_log_mount_finish because any refcount btree expansion (however unlikely given that we're removing records from the right side of the index) must be fed by a per-AG reservation, which doesn't exist in its current location. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: don't expose internal symlink metadata buffers to the vfsDarrick J. Wong
Ian Kent reported that for inline symlinks, it's possible for vfs_readlink to hang on to the target buffer returned by _vn_get_link_inline long after it's been freed by xfs inode reclaim. This is a layering violation -- we should never expose XFS internals to the VFS. When the symlink has a remote target, we allocate a separate buffer, copy the internal information, and let the VFS manage the new buffer's lifetime. Let's adapt the inline code paths to do this too. It's less efficient, but fixes the layering violation and avoids the need to adapt the if_data lifetime to rcu rules. Clearly I don't care about readlink benchmarks. As a side note, this fixes the minor locking violation where we can access the inode data fork without taking any locks; proper locking (and eliminating the possibility of having to switch inode_operations on a live inode) is essential to online repair coordinating repairs correctly. Reported-by: Ian Kent <raven@themaw.net> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: fix quotaoff mutex usage now that we don't support disabling itDarrick J. Wong
Prior to commit 40b52225e58c ("xfs: remove support for disabling quota accounting on a mounted file system"), we used the quotaoff mutex to protect dquot operations against quotaoff trying to pull down dquots as part of disabling quota. Now that we only support turning off quota enforcement, the quotaoff mutex only protects changes in m_qflags/sb_qflags. We don't need it to protect dquots, which means we can remove it from setqlimits and the dquot scrub code. While we're at it, fix the function that forces quotacheck, since it should have been taking the quotaoff mutex. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: shut down filesystem if we xfs_trans_cancel with deferred work itemsDarrick J. Wong
While debugging some very strange rmap corruption reports in connection with the online directory repair code. I root-caused the error to the following incorrect sequence: <start repair transaction> <expand directory, causing a deferred rmap to be queued> <roll transaction> <cancel transaction> Obviously, we should have committed the transaction instead of cancelling it. Thinking more broadly, however, xfs_trans_cancel should have warned us that we were throwing away work item that we already committed to performing. This is not correct, and we need to shut down the filesystem. Change xfs_trans_cancel to complain in the loudest manner if we're cancelling any transaction with deferred work items attached. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-07xfs: remove all COW fork extents when remounting readonlyxfs-5.16-fixes-3Darrick J. Wong
As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. Solve this race by forcing the xfs_blockgc_free_space to run synchronously, which causes xfs_icwalk to return to inodes that were skipped because the blockgc code couldn't take the IOLOCK. This is safe to do here because the VFS has already prohibited new writer threads. Fixes: 10ddf64e420f ("xfs: remove leftover CoW reservations when remounting ro") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-12-01xfs: remove incorrect ASSERT in xfs_renamexfs-fixes-5.16_2021-12-08xfs-fixes-5.16_2021-12-07xfs-5.16-fixes-2Eric Sandeen
This ASSERT in xfs_rename is a) incorrect, because (RENAME_WHITEOUT|RENAME_NOREPLACE) is a valid combination, and b) unnecessary, because actual invalid flag combinations are already handled at the vfs level in do_renameat2() before we get called. So, remove it. Reported-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-24xfs: remove xfs_inew_waitxfs-5.16-fixes-1Christoph Hellwig
With the remove of xfs_dqrele_all_inodes, xfs_inew_wait and all the infrastructure used to wake the XFS_INEW bit waitqueue is unused. Reported-by: kernel test robot <lkp@intel.com> Fixes: 777eb1fa857e ("xfs: remove xfs_dqrele_all_inodes") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-24xfs: Fix the free logic of state in xfs_attr_node_hasnameYang Xu
When testing xfstests xfs/126 on lastest upstream kernel, it will hang on some machine. Adding a getxattr operation after xattr corrupted, I can reproduce it 100%. The deadlock as below: [983.923403] task:setfattr state:D stack: 0 pid:17639 ppid: 14687 flags:0x00000080 [ 983.923405] Call Trace: [ 983.923410] __schedule+0x2c4/0x700 [ 983.923412] schedule+0x37/0xa0 [ 983.923414] schedule_timeout+0x274/0x300 [ 983.923416] __down+0x9b/0xf0 [ 983.923451] ? xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs] [ 983.923453] down+0x3b/0x50 [ 983.923471] xfs_buf_lock+0x33/0xf0 [xfs] [ 983.923490] xfs_buf_find.isra.29+0x3c8/0x5f0 [xfs] [ 983.923508] xfs_buf_get_map+0x4c/0x320 [xfs] [ 983.923525] xfs_buf_read_map+0x53/0x310 [xfs] [ 983.923541] ? xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923560] xfs_trans_read_buf_map+0x1cf/0x360 [xfs] [ 983.923575] ? xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923590] xfs_da_read_buf+0xcf/0x120 [xfs] [ 983.923606] xfs_da3_node_read+0x1f/0x40 [xfs] [ 983.923621] xfs_da3_node_lookup_int+0x69/0x4a0 [xfs] [ 983.923624] ? kmem_cache_alloc+0x12e/0x270 [ 983.923637] xfs_attr_node_hasname+0x6e/0xa0 [xfs] [ 983.923651] xfs_has_attr+0x6e/0xd0 [xfs] [ 983.923664] xfs_attr_set+0x273/0x320 [xfs] [ 983.923683] xfs_xattr_set+0x87/0xd0 [xfs] [ 983.923686] __vfs_removexattr+0x4d/0x60 [ 983.923688] __vfs_removexattr_locked+0xac/0x130 [ 983.923689] vfs_removexattr+0x4e/0xf0 [ 983.923690] removexattr+0x4d/0x80 [ 983.923693] ? __check_object_size+0xa8/0x16b [ 983.923695] ? strncpy_from_user+0x47/0x1a0 [ 983.923696] ? getname_flags+0x6a/0x1e0 [ 983.923697] ? _cond_resched+0x15/0x30 [ 983.923699] ? __sb_start_write+0x1e/0x70 [ 983.923700] ? mnt_want_write+0x28/0x50 [ 983.923701] path_removexattr+0x9b/0xb0 [ 983.923702] __x64_sys_removexattr+0x17/0x20 [ 983.923704] do_syscall_64+0x5b/0x1a0 [ 983.923705] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 983.923707] RIP: 0033:0x7f080f10ee1b When getxattr calls xfs_attr_node_get function, xfs_da3_node_lookup_int fails with EFSCORRUPTED in xfs_attr_node_hasname because we have use blocktrash to random it in xfs/126. So it free state in internal and xfs_attr_node_get doesn't do xfs_buf_trans release job. Then subsequent removexattr will hang because of it. This bug was introduced by kernel commit 07120f1abdff ("xfs: Add xfs_has_attr and subroutines"). It adds xfs_attr_node_hasname helper and said caller will be responsible for freeing the state in this case. But xfs_attr_node_hasname will free state itself instead of caller if xfs_da3_node_lookup_int fails. Fix this bug by moving the step of free state into caller. Also, use "goto error/out" instead of returning error directly in xfs_attr_node_addname_find_attr and xfs_attr_node_removename_setup function because we should free state ourselves. Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-14Merge tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs cleanups from Darrick Wong: "The most 'exciting' aspect of this branch is that the xfsprogs maintainer and I have worked through the last of the code discrepancies between kernel and userspace libxfs such that there are no code differences between the two except for #includes. IOWs, diff suffices to demonstrate that the userspace tools behave the same as the kernel, and kernel-only bits are clearly marked in the /kernel/ source code instead of just the userspace source. Summary: - Clean up open-coded swap() calls. - A little bit of #ifdef golf to complete the reunification of the kernel and userspace libxfs source code" * tag 'xfs-5.16-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: sync xfs_btree_split macros with userspace libxfs xfs: #ifdef out perag code for userspace xfs: use swap() to make dabtree code cleaner
2021-11-11xfs: sync xfs_btree_split macros with userspace libxfsxfs-5.16-merge-5Darrick J. Wong
Sync this one last bit of discrepancy between kernel and userspace libxfs. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2021-11-10xfs: #ifdef out perag code for userspaceEric Sandeen
The xfs_perag structure and initialization is unused in userspace, so #ifdef it out with __KERNEL__ to facilitate the xfsprogs sync and build. Signed-off-by: Eric Sandeen <esandeen@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-08xfs: use swap() to make dabtree code cleanerYang Guang
Use the macro 'swap()' defined in 'include/linux/minmax.h' to avoid opencoding it. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Yang Guang <yang.guang5@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-11-02Merge tag 'xfs-5.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "This cycle we've worked on fixing bugs and improving XFS' memory footprint. The most notable fixes include: fixing a corruption warning (and free space accounting skew) if copy on write fails; fixing slab cache misuse if SLOB is enabled, which apparently was broken for years without anybody noticing; and fixing a potential race with online shrinkfs. Otherwise, the bulk of the changes here involve setting up separate slab caches for frequently used items such as btree cursors and log intent items, and compacting the structures to reduce memory usage of those items substantially. This also sets us up to support larger btrees in future kernels. We also switch parts of online fsck to allocate scrub context information from the heap instead of using stack space. Summary: - Bug fixes and cleanups for kernel memory allocation usage, this time without touching the mm code. - Refactor the log recovery mechanism that preserves held resources across a transaction roll so that it uses the exact same mechanism that we use for that during regular runtime. - Fix bugs and tighten checking around btree heights. - Remove more old typedefs. - Fix perag reference leaks when racing with growfs. - Remove unused fields from xfs_btree_cur. - Allocate various scrub structures on the heap to reduce stack usage. - Pack xfs_btree_cur fields and rearrange to support arbitrary heights. - Compute maximum possible heights for each btree height, and use that to set up slab caches for each btree type. - Finally remove kmem_zone_t, since these have always been struct kmem_cache on Linux. - Compact the structures used to coordinate work intent items. - Set up slab caches for each work intent item type. - Rename the "bmap_add_free" function to "free_extent_later", which more accurately describes what it does. - Fix corruption warning on unmount when a CoW preallocation covers a data fork delalloc reservation but then the CoW fails. - Add some more minor code improvements" * tag 'xfs-5.16-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (45 commits) xfs: use swap() to make code cleaner xfs: Remove duplicated include in xfs_super xfs: punch out data fork delalloc blocks on COW writeback failure xfs: remove unused parameter from refcount code xfs: reduce the size of struct xfs_extent_free_item xfs: rename xfs_bmap_add_free to xfs_free_extent_later xfs: create slab caches for frequently-used deferred items xfs: compact deferred intent item structures xfs: rename _zone variables to _cache xfs: remove kmem_zone typedef xfs: use separate btree cursor cache for each btree type xfs: compute absolute maximum nlevels for each btree type xfs: kill XFS_BTREE_MAXLEVELS xfs: compute the maximum height of the rmap btree when reflink enabled xfs: clean up xfs_btree_{calc_size,compute_maxlevels} xfs: compute maximum AG btree height for critical reservation calculation xfs: rename m_ag_maxlevels to m_allocbt_maxlevels xfs: dynamically allocate cursors based on maxlevels xfs: encode the max btree height in the cursor xfs: refactor btree cursor allocation function ...
2021-11-02Merge tag 'gfs2-v5.15-rc5-mmap-fault' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 mmap + page fault deadlocks fixes from Andreas Gruenbacher: "Functions gfs2_file_read_iter and gfs2_file_write_iter are both accessing the user buffer to write to or read from while holding the inode glock. In the most basic deadlock scenario, that buffer will not be resident and it will be mapped to the same file. Accessing the buffer will trigger a page fault, and gfs2 will deadlock trying to take the same inode glock again while trying to handle that fault. Fix that and similar, more complex scenarios by disabling page faults while accessing user buffers. To make this work, introduce a small amount of new infrastructure and fix some bugs that didn't trigger so far, with page faults enabled" * tag 'gfs2-v5.15-rc5-mmap-fault' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: Fix mmap + page fault deadlocks for direct I/O iov_iter: Introduce nofault flag to disable page faults gup: Introduce FOLL_NOFAULT flag to disable page faults iomap: Add done_before argument to iomap_dio_rw iomap: Support partial direct I/O on user copy failures iomap: Fix iomap_dio_rw return value for user copies gfs2: Fix mmap + page fault deadlocks for buffered I/O gfs2: Eliminate ip->i_gh gfs2: Move the inode glock locking to gfs2_file_buffered_write gfs2: Introduce flag for glock holder auto-demotion gfs2: Clean up function may_grant gfs2: Add wrapper for iomap_file_buffered_write iov_iter: Introduce fault_in_iov_iter_writeable iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} powerpc/kvm: Fix kvm_use_magic_page iov_iter: Fix iov_iter_get_pages{,_alloc} page fault return value
2021-11-01Merge tag 'kspp-misc-fixes-5.16-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull hardening fixes and cleanups from Gustavo A. R. Silva: "Various hardening fixes and cleanups that I've been collecting during the last development cycle: Fix -Wcast-function-type error: - firewire: Remove function callback casts (Oscar Carter) Fix application of sizeof operator: - firmware/psci: fix application of sizeof to pointer (jing yangyang) Replace open coded instances with size_t saturating arithmetic helpers: - assoc_array: Avoid open coded arithmetic in allocator arguments (Len Baker) - writeback: prefer struct_size over open coded arithmetic (Len Baker) - aio: Prefer struct_size over open coded arithmetic (Len Baker) - dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic (Len Baker) Flexible array transformation: - KVM: PPC: Replace zero-length array with flexible array member (Len Baker) Use 2-factor argument multiplication form: - nouveau/svm: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva) - xfs: Use kvcalloc() instead of kvzalloc() (Gustavo A. R. Silva)" * tag 'kspp-misc-fixes-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: firewire: Remove function callback casts nouveau/svm: Use kvcalloc() instead of kvzalloc() firmware/psci: fix application of sizeof to pointer dmaengine: pxa_dma: Prefer struct_size over open coded arithmetic KVM: PPC: Replace zero-length array with flexible array member aio: Prefer struct_size over open coded arithmetic writeback: prefer struct_size over open coded arithmetic xfs: Use kvcalloc() instead of kvzalloc() assoc_array: Avoid open coded arithmetic in allocator arguments
2021-10-30xfs: use swap() to make code cleanerxfs-5.16-merge-4Changcheng Deng
Use swap() in order to make code cleaner. Issue found by coccinelle. Reported-by: Zeal Robot <zealci@zte.com.cn> Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-30xfs: Remove duplicated include in xfs_superWan Jiabing
Fix following checkincludes.pl warning: ./fs/xfs/xfs_super.c: xfs_btree.h is included more than once. The include is in line 15. Remove the duplicated here. Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-24iomap: Add done_before argument to iomap_dio_rwAndreas Gruenbacher
Add a done_before argument to iomap_dio_rw that indicates how much of the request has already been transferred. When the request succeeds, we report that done_before additional bytes were tranferred. This is useful for finishing a request asynchronously when part of the request has already been completed synchronously. We'll use that to allow iomap_dio_rw to be used with page faults disabled: when a page fault occurs while submitting a request, we synchronously complete the part of the request that has already been submitted. The caller can then take care of the page fault and call iomap_dio_rw again for the rest of the request, passing in the number of bytes already tranferred. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-10-22xfs: punch out data fork delalloc blocks on COW writeback failurexfs-5.16-merge-3Brian Foster
If writeback I/O to a COW extent fails, the COW fork blocks are punched out and the data fork blocks left alone. It is possible for COW fork blocks to overlap non-shared data fork blocks (due to cowextsz hint prealloc), however, and writeback unconditionally maps to the COW fork whenever blocks exist at the corresponding offset of the page undergoing writeback. This means it's quite possible for a COW fork extent to overlap delalloc data fork blocks, writeback to convert and map to the COW fork blocks, writeback to fail, and finally for ioend completion to cancel the COW fork blocks and leave stale data fork delalloc blocks around in the inode. The blocks are effectively stale because writeback failure also discards dirty page state. If this occurs, it is likely to trigger assert failures, free space accounting corruption and failures in unrelated file operations. For example, a subsequent reflink attempt of the affected file to a new target file will trip over the stale delalloc in the source file and fail. Several of these issues are occasionally reproduced by generic/648, but are reproducible on demand with the right sequence of operations and timely I/O error injection. To fix this problem, update the ioend failure path to also punch out underlying data fork delalloc blocks on I/O error. This is analogous to the writeback submission failure path in xfs_discard_page() where we might fail to map data fork delalloc blocks and consistent with the successful COW writeback completion path, which is responsible for unmapping from the data fork and remapping in COW fork blocks. Fixes: 787eb485509f ("xfs: fix and streamline error handling in xfs_end_io") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-10-22xfs: remove unused parameter from refcount codeDarrick J. Wong
The owner info parameter is always NULL, so get rid of the parameter. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22xfs: reduce the size of struct xfs_extent_free_itemDarrick J. Wong
We only use EFIs to free metadata blocks -- not regular data/attr fork extents. Remove all the fields that we never use, for a net reduction of 16 bytes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22xfs: rename xfs_bmap_add_free to xfs_free_extent_laterDarrick J. Wong
xfs_bmap_add_free isn't a block mapping function; it schedules deferred freeing operations for a later point in a compound transaction chain. While it's primarily used by bunmapi, its use has expanded beyond that. Move it to xfs_alloc.c and rename the function since it's now general freeing functionality. Bring the slab cache bits in line with the way we handle the other intent items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22xfs: create slab caches for frequently-used deferred itemsDarrick J. Wong
Create slab caches for the high-level structures that coordinate deferred intent items, since they're used fairly heavily. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22xfs: compact deferred intent item structuresDarrick J. Wong
Rearrange these structs to reduce the amount of unused padding bytes. This saves eight bytes for each of the three structs changed here, which means they're now all (rmap/bmap are 64 bytes, refc is 32 bytes) even powers of two. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22xfs: rename _zone variables to _cacheDarrick J. Wong
Now that we've gotten rid of the kmem_zone_t typedef, rename the variables to _cache since that's what they are. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-22xfs: remove kmem_zone typedefDarrick J. Wong
Remove these typedefs by referencing kmem_cache directly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-10-20xfs: Use kvcalloc() instead of kvzalloc()Gustavo A. R. Silva
Use 2-factor argument multiplication form kvcalloc() instead of kvzalloc(). Link: https://github.com/KSPP/linux/issues/162 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2021-10-19xfs: use separate btree cursor cache for each btree typeDarrick J. Wong
Now that we have the infrastructure to track the max possible height of each btree type, we can create a separate slab cache for cursors of each type of btree. For smaller indices like the free space btrees, this means that we can pack more cursors into a slab page, improving slab utilization. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: compute absolute maximum nlevels for each btree typeDarrick J. Wong
Add code for all five btree types so that we can compute the absolute maximum possible btree height for each btree type. This is a setup for the next patch, which makes every btree type have its own cursor cache. The functions are exported so that we can have xfs_db report the absolute maximum btree heights for each btree type, rather than making everyone run their own ad-hoc computations. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: kill XFS_BTREE_MAXLEVELSDarrick J. Wong
Nobody uses this symbol anymore, so kill it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: compute the maximum height of the rmap btree when reflink enabledDarrick J. Wong
Instead of assuming that the hardcoded XFS_BTREE_MAXLEVELS value is big enough to handle the maximally tall rmap btree when all blocks are in use and maximally shared, let's compute the maximum height assuming the rmapbt consumes as many blocks as possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: clean up xfs_btree_{calc_size,compute_maxlevels}Darrick J. Wong
During review of the next patch, Dave remarked that he found these two btree geometry calculation functions lacking in documentation and that they performed more work than was really necessary. These functions take the same parameters and have nearly the same logic; the only real difference is in the return values. Reword the function comment to make it clearer what each function does, and move them to be adjacent to reinforce their relation. Clean up both of them to stop opencoding the howmany functions, stop using the uint typedefs, and make them both support computations for more than 2^32 leaf records, since we're going to need all of the above for files with large data forks and large rmap btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: compute maximum AG btree height for critical reservation calculationDarrick J. Wong
Compute the actual maximum AG btree height for deciding if a per-AG block reservation is critically low. This only affects the sanity check condition, since we /generally/ will trigger on the 10% threshold. This is a long-winded way of saying that we're removing one more usage of XFS_BTREE_MAXLEVELS. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: rename m_ag_maxlevels to m_allocbt_maxlevelsDarrick J. Wong
Years ago when XFS was thought to be much more simple, we introduced m_ag_maxlevels to specify the maximum btree height of per-AG btrees for a given filesystem mount. Then we observed that inode btrees don't actually have the same height and split that off; and now we have rmap and refcount btrees with much different geometries and separate maxlevels variables. The 'ag' part of the name doesn't make much sense anymore, so rename this to m_alloc_maxlevels to reinforce that this is the maximum height of the *free space* btrees. This sets us up for the next patch, which will add a variable to track the maximum height of all AG btrees. (Also take the opportunity to improve adjacent comments and fix minor style problems.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: dynamically allocate cursors based on maxlevelsDarrick J. Wong
To support future btree code, we need to be able to size btree cursors dynamically for very large btrees. Switch the maxlevels computation to use the precomputed values in the superblock, and create cursors that can handle a certain height. For now, we retain the btree cursor cache that can handle up to 9-level btrees, though a subsequent patch introduces separate caches for each btree type, where each cache's objects will be exactly tall enough to handle the specific btree type. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: encode the max btree height in the cursorDarrick J. Wong
Encode the maximum btree height in the cursor, since we're soon going to allow smaller cursors for AG btrees and larger cursors for file btrees. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: refactor btree cursor allocation functionDarrick J. Wong
Refactor btree allocation to a common helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: rearrange xfs_btree_cur fields for better packingDarrick J. Wong
Reduce the size of the btree cursor structure some more by rearranging fields to eliminate unused space. While we're at it, fix the ragged indentation and a spelling error. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: prepare xfs_btree_cur for dynamic cursor heightsDarrick J. Wong
Split out the btree level information into a separate struct and put it at the end of the cursor structure as a VLA. Files with huge data forks (and in the future, the realtime rmap btree) will require the ability to support many more levels than a per-AG btree cursor, which means that we're going to create per-btree type cursor caches to conserve memory for the more common case. Note that a subsequent patch actually introduces dynamic cursor heights. This one merely rearranges the structure to prepare for that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>