summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-01-17xfs: remove the XFS_IOC_FSSETDM definitionsxfs-5.17-merge-4Darrick J. Wong
Remove the definitions for these ioctls, since the functionality (and, weirdly, the 32-bit compat ioctl definitions) were removed from the kernel in November 2019. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-12xfs: fix online fsck handling of v5 feature bits on secondary supersxfs-5.17-merge-3Darrick J. Wong
While I was auditing the code in xfs_repair that adds feature bits to existing V5 filesystems, I decided to have a look at how online fsck handles feature bits, and I found a few problems: 1) ATTR2 is added to the primary super when an xattr is set to a file, but that isn't consistently propagated to secondary supers. This isn't a corruption, merely a discrepancy that repair will fix if it ever has to restore the primary from a secondary. Hence, if we find a mismatch on a secondary, this is a preen condition, not a corruption. 2) There are more compat and ro_compat features now than there used to be, but we mask off the newer features from testing. This means we ignore inconsistencies in the INOBTCOUNT and BIGTIME features, which is wrong. Get rid of the masking and compare directly. 3) NEEDSREPAIR, when set on a secondary, is ignored by everyone. Hence a mismatch here should also be flagged for preening, and online repair should clear the flag. Right now we ignore it due to (2). 4) log_incompat features are ephemeral, since we can clear the feature bit as soon as the log no longer contains live records for a particular log feature. As such, the only copy we care about is the one in the primary super. If we find any bits set in the secondary super, we should flag that for preening, and clear the bits if the user elects to repair it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-11xfs: take the ILOCK when readdir inspects directory mapping dataxfs-5.17-merge_2022-01-11Darrick J. Wong
I was poking around in the directory code while diagnosing online fsck bugs, and noticed that xfs_readdir doesn't actually take the directory ILOCK when it calls xfs_dir2_isblock. xfs_dir_open most probably loaded the data fork mappings and the VFS took i_rwsem (aka IOLOCK_SHARED) so we're protected against writer threads, but we really need to follow the locking model like we do in other places. To avoid unnecessarily cycling the ILOCK for fairly small directories, change the block/leaf _getdents functions to consume the ILOCK hold that the parent readdir function took to decide on a _getdents implementation. It is ok to cycle the ILOCK in readdir because the VFS takes the IOLOCK in the appropriate mode during lookups and writes, and we don't want to be holding the ILOCK when we copy directory entries to userspace in case there's a page fault. We really only need it to protect against data fork lookups, like we do for other files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: warn about inodes with project id of -1xfs-5.17-merge-2Darrick J. Wong
Inodes aren't supposed to have a project id of -1U (aka 4294967295) but the kernel hasn't always validated FSSETXATTR correctly. Flag this as something for the sysadmin to check out. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: hold quota inode ILOCK_EXCL until the end of dqallocDarrick J. Wong
Online fsck depends on callers holding ILOCK_EXCL from the time they decide to update a block mapping until after they've updated the reverse mapping records to guarantee the stability of both mapping records. Unfortunately, the quota code drops ILOCK_EXCL at the first transaction roll in the dquot allocation process, which breaks that assertion. This leads to sporadic failures in the online rmap repair code if the repair code grabs the AGF after bmapi_write maps a new block into the quota file's data fork but before it can finish the deferred rmap update. Fix this by rewriting the function to hold the ILOCK until after the transaction commit like all other bmap updates do, and get rid of the dqread wrapper that does nothing but complicate the codebase. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-01-06xfs: Remove redundant assignment of mpJiapeng Chong
mp is being initialized to log->l_mp but this is never read as record is overwritten later on. Remove the redundant assignment. Cleans up the following clang-analyzer warning: fs/xfs/xfs_log_recover.c:3543:20: warning: Value stored to 'mp' during its initialization is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06xfs: reduce kvmalloc overhead for CIL shadow buffersDave Chinner
Oh, let me count the ways that the kvmalloc API sucks dog eggs. The problem is when we are logging lots of large objects, we hit kvmalloc really damn hard with costly order allocations, and behaviour utterly sucks: - 49.73% xlog_cil_commit - 31.62% kvmalloc_node - 29.96% __kmalloc_node - 29.38% kmalloc_large_node - 29.33% __alloc_pages - 24.33% __alloc_pages_slowpath.constprop.0 - 18.35% __alloc_pages_direct_compact - 17.39% try_to_compact_pages - compact_zone_order - 15.26% compact_zone 5.29% __pageblock_pfn_to_page 3.71% PageHuge - 1.44% isolate_migratepages_block 0.71% set_pfnblock_flags_mask 1.11% get_pfnblock_flags_mask - 0.81% get_page_from_freelist - 0.59% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 3.24% try_to_free_pages - 3.14% shrink_node - 2.94% shrink_slab.constprop.0 - 0.89% super_cache_count - 0.66% xfs_fs_nr_cached_objects - 0.65% xfs_reclaim_inodes_count 0.55% xfs_perag_get_tag 0.58% kfree_rcu_shrink_count - 2.09% get_page_from_freelist - 1.03% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 4.88% get_page_from_freelist - 3.66% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 1.63% __vmalloc_node - __vmalloc_node_range - 1.10% __alloc_pages_bulk - 0.93% __alloc_pages - 0.92% get_page_from_freelist - 0.89% rmqueue_bulk - 0.69% _raw_spin_lock - do_raw_spin_lock __pv_queued_spin_lock_slowpath 13.73% memcpy_erms - 2.22% kvfree On this workload, that's almost a dozen CPUs all trying to compact and reclaim memory inside kvmalloc_node at the same time. Yet it is regularly falling back to vmalloc despite all that compaction, page and shrinker reclaim that direct reclaim is doing. Copying all the metadata is taking far less CPU time than allocating the storage! Direct reclaim should be considered extremely harmful. This is a high frequency, high throughput, CPU usage and latency sensitive allocation. We've got memory there, and we're using kvmalloc to allow memory allocation to avoid doing lots of work to try to do contiguous allocations. Except it still does *lots of costly work* that is unnecessary. Worse: the only way to avoid the slowpath page allocation trying to do compaction on costly allocations is to turn off direct reclaim (i.e. remove __GFP_RECLAIM_DIRECT from the gfp flags). Unfortunately, the stupid kvmalloc API then says "oh, this isn't a GFP_KERNEL allocation context, so you only get kmalloc!". This cuts off the vmalloc fallback, and this leads to almost instant OOM problems which ends up in filesystems deadlocks, shutdowns and/or kernel crashes. I want some basic kvmalloc behaviour: - kmalloc for a contiguous range with fail fast semantics - no compaction direct reclaim if the allocation enters the slow path. - run normal vmalloc (i.e. GFP_KERNEL) if kmalloc fails The really, really stupid part about this is these kvmalloc() calls are run under memalloc_nofs task context, so all the allocations are always reduced to GFP_NOFS regardless of the fact that kvmalloc requires GFP_KERNEL to be passed in. IOWs, we're already telling kvmalloc to behave differently to the gfp flags we pass in, but it still won't allow vmalloc to be run with anything other than GFP_KERNEL. So, this patch open codes the kvmalloc() in the commit path to have the above described behaviour. The result is we more than halve the CPU time spend doing kvmalloc() in this path and transaction commits with 64kB objects in them more than doubles. i.e. we get ~5x reduction in CPU usage per costly-sized kvmalloc() invocation and the profile looks like this: - 37.60% xlog_cil_commit 16.01% memcpy_erms - 8.45% __kmalloc - 8.04% kmalloc_order_trace - 8.03% kmalloc_order - 7.93% alloc_pages - 7.90% __alloc_pages - 4.05% __alloc_pages_slowpath.constprop.0 - 2.18% get_page_from_freelist - 1.77% wake_all_kswapds .... - __wake_up_common_lock - 0.94% _raw_spin_lock_irqsave - 3.72% get_page_from_freelist - 2.43% _raw_spin_lock_irqsave - 5.72% vmalloc - 5.72% __vmalloc_node_range - 4.81% __get_vm_area_node.constprop.0 - 3.26% alloc_vmap_area - 2.52% _raw_spin_lock - 1.46% _raw_spin_lock 0.56% __alloc_pages_bulk - 4.66% kvfree - 3.25% vfree - __vfree - 3.23% __vunmap - 1.95% remove_vm_area - 1.06% free_vmap_area_noflush - 0.82% _raw_spin_lock - 0.68% _raw_spin_lock - 0.92% _raw_spin_lock - 1.40% kfree - 1.36% __free_pages - 1.35% __free_pages_ok - 1.02% _raw_spin_lock_irqsave It's worth noting that over 50% of the CPU time spent allocating these shadow buffers is now spent on spinlocks. So the shadow buffer allocation overhead is greatly reduced by getting rid of direct reclaim from kmalloc, and could probably be made even less costly if vmalloc() didn't use global spinlocks to protect it's structures. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-01-06xfs: sysfs: use default_groups in kobj_typeGreg Kroah-Hartman
There are currently 2 ways to create a set of sysfs files for a kobj_type, through the default_attrs field, and the default_groups field. Move the xfs sysfs code to use default_groups field which has been the preferred way since aa30f47cf666 ("kobject: Add support for default attribute groups to kobj_type") so that we can soon get rid of the obsolete default_attrs field. Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: linux-xfs@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-22xfs: prevent UAF in xfs_log_item_in_current_chkptDarrick J. Wong
While I was running with KASAN and lockdep enabled, I stumbled upon an KASAN report about a UAF to a freed CIL checkpoint. Looking at the comment for xfs_log_item_in_current_chkpt, it seems pretty obvious to me that the original patch to xfs_defer_finish_noroll should have done something to lock the CIL to prevent it from switching the CIL contexts while the predicate runs. For upper level code that needs to know if a given log item is new enough not to need relogging, add a new wrapper that takes the CIL context lock long enough to sample the current CIL context. This is kind of racy in that the CIL can switch the contexts immediately after sampling, but that's ok because the consequence is that the defer ops code is a little slow to relog items. ================================================================== BUG: KASAN: use-after-free in xfs_log_item_in_current_chkpt+0x139/0x160 [xfs] Read of size 8 at addr ffff88804ea5f608 by task fsstress/527999 CPU: 1 PID: 527999 Comm: fsstress Tainted: G D 5.16.0-rc4-xfsx #rc4 Call Trace: <TASK> dump_stack_lvl+0x45/0x59 print_address_description.constprop.0+0x1f/0x140 kasan_report.cold+0x83/0xdf xfs_log_item_in_current_chkpt+0x139/0x160 xfs_defer_finish_noroll+0x3bb/0x1e30 __xfs_trans_commit+0x6c8/0xcf0 xfs_reflink_remap_extent+0x66f/0x10e0 xfs_reflink_remap_blocks+0x2dd/0xa90 xfs_file_remap_range+0x27b/0xc30 vfs_dedupe_file_range_one+0x368/0x420 vfs_dedupe_file_range+0x37c/0x5d0 do_vfs_ioctl+0x308/0x1260 __x64_sys_ioctl+0xa1/0x170 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f2c71a2950b Code: 0f 1e fa 48 8b 05 85 39 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 55 39 0d 00 f7 d8 64 89 01 48 RSP: 002b:00007ffe8c0e03c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00005600862a8740 RCX: 00007f2c71a2950b RDX: 00005600862a7be0 RSI: 00000000c0189436 RDI: 0000000000000004 RBP: 000000000000000b R08: 0000000000000027 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000005a R13: 00005600862804a8 R14: 0000000000016000 R15: 00005600862a8a20 </TASK> Allocated by task 464064: kasan_save_stack+0x1e/0x50 __kasan_kmalloc+0x81/0xa0 kmem_alloc+0xcd/0x2c0 [xfs] xlog_cil_ctx_alloc+0x17/0x1e0 [xfs] xlog_cil_push_work+0x141/0x13d0 [xfs] process_one_work+0x7f6/0x1380 worker_thread+0x59d/0x1040 kthread+0x3b0/0x490 ret_from_fork+0x1f/0x30 Freed by task 51: kasan_save_stack+0x1e/0x50 kasan_set_track+0x21/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0xed/0x130 slab_free_freelist_hook+0x7f/0x160 kfree+0xde/0x340 xlog_cil_committed+0xbfd/0xfe0 [xfs] xlog_cil_process_committed+0x103/0x1c0 [xfs] xlog_state_do_callback+0x45d/0xbd0 [xfs] xlog_ioend_work+0x116/0x1c0 [xfs] process_one_work+0x7f6/0x1380 worker_thread+0x59d/0x1040 kthread+0x3b0/0x490 ret_from_fork+0x1f/0x30 Last potentially related work creation: kasan_save_stack+0x1e/0x50 __kasan_record_aux_stack+0xb7/0xc0 insert_work+0x48/0x2e0 __queue_work+0x4e7/0xda0 queue_work_on+0x69/0x80 xlog_cil_push_now.isra.0+0x16b/0x210 [xfs] xlog_cil_force_seq+0x1b7/0x850 [xfs] xfs_log_force_seq+0x1c7/0x670 [xfs] xfs_file_fsync+0x7c1/0xa60 [xfs] __x64_sys_fsync+0x52/0x80 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff88804ea5f600 which belongs to the cache kmalloc-256 of size 256 The buggy address is located 8 bytes inside of 256-byte region [ffff88804ea5f600, ffff88804ea5f700) The buggy address belongs to the page: page:ffffea00013a9780 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804ea5ea00 pfn:0x4ea5e head:ffffea00013a9780 order:1 compound_mapcount:0 flags: 0x4fff80000010200(slab|head|node=1|zone=1|lastcpupid=0xfff) raw: 04fff80000010200 ffffea0001245908 ffffea00011bd388 ffff888004c42b40 raw: ffff88804ea5ea00 0000000000100009 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88804ea5f500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88804ea5f580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88804ea5f600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88804ea5f680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88804ea5f700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Fixes: 4e919af7827a ("xfs: periodically relog deferred intent items") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: prevent a WARN_ONCE() in xfs_ioc_attr_list()xfs-5.17-merge-1Dan Carpenter
The "bufsize" comes from the root user. If "bufsize" is negative then, because of type promotion, neither of the validation checks at the start of the function are able to catch it: if (bufsize < sizeof(struct xfs_attrlist) || bufsize > XFS_XATTR_LIST_MAX) return -EINVAL; This means "bufsize" will trigger (WARN_ON_ONCE(size > INT_MAX)) in kvmalloc_node(). Fix this by changing the type from int to size_t. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: Fix comments mentioning xfs_iallocYang Xu
Since kernel commit 1abcf261016e ("xfs: move on-disk inode allocation out of xfs_ialloc()"), xfs_ialloc has been renamed to xfs_init_new_inode. So update this in comments. Signed-off-by: Yang Xu <xuyang2018.jy@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: check sb_meta_uuid for dabuf buffer recoveryDave Chinner
Got a report that a repeated crash test of a container host would eventually fail with a log recovery error preventing the system from mounting the root filesystem. It manifested as a directory leaf node corruption on writeback like so: XFS (loop0): Mounting V5 Filesystem XFS (loop0): Starting recovery (logdev: internal) XFS (loop0): Metadata corruption detected at xfs_dir3_leaf_check_int+0x99/0xf0, xfs_dir3_leaf1 block 0x12faa158 XFS (loop0): Unmount and run xfs_repair XFS (loop0): First 128 bytes of corrupted metadata buffer: 00000000: 00 00 00 00 00 00 00 00 3d f1 00 00 e1 9e d5 8b ........=....... 00000010: 00 00 00 00 12 fa a1 58 00 00 00 29 00 00 1b cc .......X...).... 00000020: 91 06 78 ff f7 7e 4a 7d 8d 53 86 f2 ac 47 a8 23 ..x..~J}.S...G.# 00000030: 00 00 00 00 17 e0 00 80 00 43 00 00 00 00 00 00 .........C...... 00000040: 00 00 00 2e 00 00 00 08 00 00 17 2e 00 00 00 0a ................ 00000050: 02 35 79 83 00 00 00 30 04 d3 b4 80 00 00 01 50 .5y....0.......P 00000060: 08 40 95 7f 00 00 02 98 08 41 fe b7 00 00 02 d4 .@.......A...... 00000070: 0d 62 ef a7 00 00 01 f2 14 50 21 41 00 00 00 0c .b.......P!A.... XFS (loop0): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_buf.c:1514). Shutting down. XFS (loop0): Please unmount the filesystem and rectify the problem(s) XFS (loop0): log mount/recovery failed: error -117 XFS (loop0): log mount failed Tracing indicated that we were recovering changes from a transaction at LSN 0x29/0x1c16 into a buffer that had an LSN of 0x29/0x1d57. That is, log recovery was overwriting a buffer with newer changes on disk than was in the transaction. Tracing indicated that we were hitting the "recovery immediately" case in xfs_buf_log_recovery_lsn(), and hence it was ignoring the LSN in the buffer. The code was extracting the LSN correctly, then ignoring it because the UUID in the buffer did not match the superblock UUID. The problem arises because the UUID check uses the wrong UUID - it should be checking the sb_meta_uuid, not sb_uuid. This filesystem has sb_uuid != sb_meta_uuid (which is fine), and the buffer has the correct matching sb_meta_uuid in it, it's just the code checked it against the wrong superblock uuid. The is no corruption in the filesystem, and failing to recover the buffer due to a write verifier failure means the recovery bug did not propagate the corruption to disk. Hence there is no corruption before or after this bug has manifested, the impact is limited simply to an unmountable filesystem.... This was missed back in 2015 during an audit of incorrect sb_uuid usage that resulted in commit fcfbe2c4ef42 ("xfs: log recovery needs to validate against sb_meta_uuid") that fixed the magic32 buffers to validate against sb_meta_uuid instead of sb_uuid. It missed the magicda buffers.... Fixes: ce748eaa65f2 ("xfs: create new metadata UUID field and incompat flag") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-12-21xfs: fix a bug in the online fsck directory leaf1 bestcount checkDarrick J. Wong
When xfs_scrub encounters a directory with a leaf1 block, it tries to validate that the leaf1 block's bestcount (aka the best free count of each directory data block) is the correct size. Previously, this author believed that comparing bestcount to the directory isize (since directory data blocks are under isize, and leaf/bestfree blocks are above it) was sufficient. Unfortunately during testing of online repair, it was discovered that it is possible to create a directory with a hole between the last directory block and isize. The directory code seems to handle this situation just fine and xfs_repair doesn't complain, which effectively makes this quirk part of the disk format. Fix the check to work properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: only run COW extent recovery when there are no live extentsDarrick J. Wong
As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. In the first patch, we fixed the race condition in (2) so that (A) will always flush the COW fork. In this second patch, we move the _recover_cow call to the initial mount call in (0) for safety. As mentioned previously, xfs_reflink_recover_cow walks the refcount btree looking for COW staging extents, and frees them. This was intended to be run at mount time (when we know there are no live inodes) to clean up any leftover staging events that may have been left behind during an unclean shutdown. As a time "optimization" for readonly mounts, we deferred this to the ro->rw transition, not realizing that any failure to clean all COW forks during a rw->ro transition would result in catastrophic corruption. Therefore, remove this optimization and only run the recovery routine when we're guaranteed not to have any COW staging extents anywhere, which means we always run this at mount time. While we're at it, move the callsite to xfs_log_mount_finish because any refcount btree expansion (however unlikely given that we're removing records from the right side of the index) must be fed by a per-AG reservation, which doesn't exist in its current location. Fixes: 174edb0e46e5 ("xfs: store in-progress CoW allocations in the refcount btree") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: don't expose internal symlink metadata buffers to the vfsDarrick J. Wong
Ian Kent reported that for inline symlinks, it's possible for vfs_readlink to hang on to the target buffer returned by _vn_get_link_inline long after it's been freed by xfs inode reclaim. This is a layering violation -- we should never expose XFS internals to the VFS. When the symlink has a remote target, we allocate a separate buffer, copy the internal information, and let the VFS manage the new buffer's lifetime. Let's adapt the inline code paths to do this too. It's less efficient, but fixes the layering violation and avoids the need to adapt the if_data lifetime to rcu rules. Clearly I don't care about readlink benchmarks. As a side note, this fixes the minor locking violation where we can access the inode data fork without taking any locks; proper locking (and eliminating the possibility of having to switch inode_operations on a live inode) is essential to online repair coordinating repairs correctly. Reported-by: Ian Kent <raven@themaw.net> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: fix quotaoff mutex usage now that we don't support disabling itDarrick J. Wong
Prior to commit 40b52225e58c ("xfs: remove support for disabling quota accounting on a mounted file system"), we used the quotaoff mutex to protect dquot operations against quotaoff trying to pull down dquots as part of disabling quota. Now that we only support turning off quota enforcement, the quotaoff mutex only protects changes in m_qflags/sb_qflags. We don't need it to protect dquots, which means we can remove it from setqlimits and the dquot scrub code. While we're at it, fix the function that forces quotacheck, since it should have been taking the quotaoff mutex. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-21xfs: shut down filesystem if we xfs_trans_cancel with deferred work itemsDarrick J. Wong
While debugging some very strange rmap corruption reports in connection with the online directory repair code. I root-caused the error to the following incorrect sequence: <start repair transaction> <expand directory, causing a deferred rmap to be queued> <roll transaction> <cancel transaction> Obviously, we should have committed the transaction instead of cancelling it. Thinking more broadly, however, xfs_trans_cancel should have warned us that we were throwing away work item that we already committed to performing. This is not correct, and we need to shut down the filesystem. Change xfs_trans_cancel to complain in the loudest manner if we're cancelling any transaction with deferred work items attached. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-12-11Merge tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Darrick Wong: "This fixes a race between a readonly remount process and other processes that hold a file IOLOCK on files that previously experienced copy on write, that could result in severe filesystem corruption if the filesystem is then remounted rw. I think this is fairly rare (since the only reliable reproducer I have that fits the second criteria is the experimental xfs_scrub program), but the race is clear, so we still need to fix this. Summary: - Fix a data corruption vector that can result from the ro remount process failing to clear all speculative preallocations from files and the rw remount process not noticing the incomplete cleanup" * tag 'xfs-5.16-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove all COW fork extents when remounting readonly
2021-12-11Merge tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "A few fixes that are all bound for stable: - Two syzbot reports for io-wq that turned out to be separate fixes, but ultimately very closely related - io_uring task_work running on cancelations" * tag 'io_uring-5.16-2021-12-10' of git://git.kernel.dk/linux-block: io-wq: check for wq exit after adding new worker task_work io_uring: ensure task_work gets run as part of cancelations io-wq: remove spurious bit clear on task_work addition
2021-12-10Merge tag 'for-5.16-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more regression fixes and stable patches, mostly one-liners. Regression fixes: - fix pointer/ERR_PTR mismatch returned from memdup_user - reset dedicated zoned mode relocation block group to avoid using it and filling it without any recourse Fixes: - handle a case to FITRIM range (also to make fstests/generic/260 work) - fix warning when extent buffer state and pages get out of sync after an IO error - fix transaction abort when syncing due to missing mapping error set on metadata inode after inlining a compressed file - fix transaction abort due to tree-log and zoned mode interacting in an unexpected way - fix memory leak of additional extent data when qgroup reservation fails - do proper handling of slot search call when deleting root refs" * tag 'for-5.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling btrfs: zoned: clear data relocation bg on zone finish btrfs: free exchange changeset on failures btrfs: fix re-dirty process of tree-log nodes btrfs: call mapping_set_error() on btree inode with a write error btrfs: clear extent buffer uptodate when we fail to write it btrfs: fail if fstrim_range->start == U64_MAX btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2()
2021-12-10Merge tag '5.16-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Two cifs/smb3 fixes - one for stable, the other fixes a recently reported NTLMSSP auth problem" * tag '5.16-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix ntlmssp auth when there is no key exchange cifs: Fix crash on unload of cifs_arc4.ko
2021-12-10Merge tag 'nfsd-5.16-2' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Fix a race on startup and another in the delegation code. The latter has been around for years, but I suspect recent changes may have widened the race window a little, so I'd like to go ahead and get it in" * tag 'nfsd-5.16-2' of git://linux-nfs.org/~bfields/linux: nfsd: fix use-after-free due to delegation race nfsd: Fix nsfd startup race (again)
2021-12-10Merge tag 'trace-v5.16-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: "Tracing, ftrace and tracefs fixes: - Have tracefs honor the gid mount option - Have new files in tracefs inherit the parent ownership - Have direct_ops unregister when it has no more functions - Properly clean up the ops when unregistering multi direct ops - Add a sample module to test the multiple direct ops - Fix memory leak in error path of __create_synth_event()" * tag 'trace-v5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix possible memory leak in __create_synth_event() error path ftrace/samples: Add module to test multi direct modify interface ftrace: Add cleanup to unregister_ftrace_direct_multi ftrace: Use direct_ops hash in unregister_ftrace_direct tracefs: Set all files to the same group ownership as the mount option tracefs: Have new files inherit the ownership of their parent
2021-12-10Merge tag 'aio-poll-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux Pull aio poll fixes from Eric Biggers: "Fix three bugs in aio poll, and one issue with POLLFREE more broadly: - aio poll didn't handle POLLFREE, causing a use-after-free. - aio poll could block while the file is ready. - aio poll called eventfd_signal() when it isn't allowed. - POLLFREE didn't handle multiple exclusive waiters correctly. This has been tested with the libaio test suite, as well as with test programs I wrote that reproduce the first two bugs. I am sending this pull request myself as no one seems to be maintaining this code" * tag 'aio-poll-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: aio: Fix incorrect usage of eventfd_signal_allowed() aio: fix use-after-free due to missing POLLFREE handling aio: keep poll requests on waitqueue until completed signalfd: use wake_up_pollfree() binder: use wake_up_pollfree() wait: add wake_up_pollfree()
2021-12-10io-wq: check for wq exit after adding new worker task_workio_uring-5.16-2021-12-10Jens Axboe
We check IO_WQ_BIT_EXIT before attempting to create a new worker, and wq exit cancels pending work if we have any. But it's possible to have a race between the two, where creation checks exit finding it not set, but we're in the process of exiting. The exit side will cancel pending creation task_work, but there's a gap where we add task_work after we've canceled existing creations at exit time. Fix this by checking the EXIT bit post adding the creation task_work. If it's set, run the same cancelation that exit does. Reported-and-tested-by: syzbot+b60c982cb0efc5e05a47@syzkaller.appspotmail.com Reviewed-by: Hao Xu <haoxu@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-10io_uring: ensure task_work gets run as part of cancelationsJens Axboe
If we successfully cancel a work item but that work item needs to be processed through task_work, then we can be sleeping uninterruptibly in io_uring_cancel_generic() and never process it. Hence we don't make forward progress and we end up with an uninterruptible sleep warning. While in there, correct a comment that should be IFF, not IIF. Reported-and-tested-by: syzbot+21e6887c0be14181206d@syzkaller.appspotmail.com Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-10nfsd: fix use-after-free due to delegation raceJ. Bruce Fields
A delegation break could arrive as soon as we've called vfs_setlease. A delegation break runs a callback which immediately (in nfsd4_cb_recall_prepare) adds the delegation to del_recall_lru. If we then exit nfs4_set_delegation without hashing the delegation, it will be freed as soon as the callback is done with it, without ever being removed from del_recall_lru. Symptoms show up later as use-after-free or list corruption warnings, usually in the laundromat thread. I suspect aba2072f4523 "nfsd: grant read delegations to clients holding writes" made this bug easier to hit, but I looked as far back as v3.0 and it looks to me it already had the same problem. So I'm not sure where the bug was introduced; it may have been there from the beginning. Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-12-10nfsd: Fix nsfd startup race (again)Alexander Sverdlin
Commit bd5ae9288d64 ("nfsd: register pernet ops last, unregister first") has re-opened rpc_pipefs_event() race against nfsd_net_id registration (register_pernet_subsys()) which has been fixed by commit bb7ffbf29e76 ("nfsd: fix nsfd startup race triggering BUG_ON"). Restore the order of register_pernet_subsys() vs register_cld_notifier(). Add WARN_ON() to prevent a future regression. Crash info: Unable to handle kernel NULL pointer dereference at virtual address 0000000000000012 CPU: 8 PID: 345 Comm: mount Not tainted 5.4.144-... #1 pc : rpc_pipefs_event+0x54/0x120 [nfsd] lr : rpc_pipefs_event+0x48/0x120 [nfsd] Call trace: rpc_pipefs_event+0x54/0x120 [nfsd] blocking_notifier_call_chain rpc_fill_super get_tree_keyed rpc_fs_get_tree vfs_get_tree do_mount ksys_mount __arm64_sys_mount el0_svc_handler el0_svc Fixes: bd5ae9288d64 ("nfsd: register pernet ops last, unregister first") Cc: stable@vger.kernel.org Signed-off-by: Alexander Sverdlin <alexander.sverdlin@nokia.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2021-12-09aio: Fix incorrect usage of eventfd_signal_allowed()Xie Yongji
We should defer eventfd_signal() to the workqueue when eventfd_signal_allowed() return false rather than return true. Fixes: b542e383d8c0 ("eventfd: Make signal recursion protection a task bit") Signed-off-by: Xie Yongji <xieyongji@bytedance.com> Link: https://lore.kernel.org/r/20210913111928.98-1-xieyongji@bytedance.com Reviewed-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-09aio: fix use-after-free due to missing POLLFREE handlingEric Biggers
signalfd_poll() and binder_poll() are special in that they use a waitqueue whose lifetime is the current task, rather than the struct file as is normally the case. This is okay for blocking polls, since a blocking poll occurs within one task; however, non-blocking polls require another solution. This solution is for the queue to be cleared before it is freed, by sending a POLLFREE notification to all waiters. Unfortunately, only eventpoll handles POLLFREE. A second type of non-blocking poll, aio poll, was added in kernel v4.18, and it doesn't handle POLLFREE. This allows a use-after-free to occur if a signalfd or binder fd is polled with aio poll, and the waitqueue gets freed. Fix this by making aio poll handle POLLFREE. A patch by Ramji Jiyani <ramjiyani@google.com> (https://lore.kernel.org/r/20211027011834.2497484-1-ramjiyani@google.com) tried to do this by making aio_poll_wake() always complete the request inline if POLLFREE is seen. However, that solution had two bugs. First, it introduced a deadlock, as it unconditionally locked the aio context while holding the waitqueue lock, which inverts the normal locking order. Second, it didn't consider that POLLFREE notifications are missed while the request has been temporarily de-queued. The second problem was solved by my previous patch. This patch then properly fixes the use-after-free by handling POLLFREE in a deadlock-free way. It does this by taking advantage of the fact that freeing of the waitqueue is RCU-delayed, similar to what eventpoll does. Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL") Cc: <stable@vger.kernel.org> # v4.18+ Link: https://lore.kernel.org/r/20211209010455.42744-6-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-09aio: keep poll requests on waitqueue until completedEric Biggers
Currently, aio_poll_wake() will always remove the poll request from the waitqueue. Then, if aio_poll_complete_work() sees that none of the polled events are ready and the request isn't cancelled, it re-adds the request to the waitqueue. (This can easily happen when polling a file that doesn't pass an event mask when waking up its waitqueue.) This is fundamentally broken for two reasons: 1. If a wakeup occurs between vfs_poll() and the request being re-added to the waitqueue, it will be missed because the request wasn't on the waitqueue at the time. Therefore, IOCB_CMD_POLL might never complete even if the polled file is ready. 2. When the request isn't on the waitqueue, there is no way to be notified that the waitqueue is being freed (which happens when its lifetime is shorter than the struct file's). This is supposed to happen via the waitqueue entries being woken up with POLLFREE. Therefore, leave the requests on the waitqueue until they are actually completed (or cancelled). To keep track of when aio_poll_complete_work needs to be scheduled, use new fields in struct poll_iocb. Remove the 'done' field which is now redundant. Note that this is consistent with how sys_poll() and eventpoll work; their wakeup functions do *not* remove the waitqueue entries. Fixes: 2c14fa838cbe ("aio: implement IOCB_CMD_POLL") Cc: <stable@vger.kernel.org> # v4.18+ Link: https://lore.kernel.org/r/20211209010455.42744-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-09signalfd: use wake_up_pollfree()Eric Biggers
wake_up_poll() uses nr_exclusive=1, so it's not guaranteed to wake up all exclusive waiters. Yet, POLLFREE *must* wake up all waiters. epoll and aio poll are fortunately not affected by this, but it's very fragile. Thus, the new function wake_up_pollfree() has been introduced. Convert signalfd to use wake_up_pollfree(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: d80e731ecab4 ("epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree()") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211209010455.42744-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com>
2021-12-08cifs: fix ntlmssp auth when there is no key exchangePaulo Alcantara
Warn on the lack of key exchange during NTLMSSP authentication rather than aborting it as there are some servers that do not set it in CHALLENGE message. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-08btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handlingQu Wenruo
I hit the BUG_ON() with generic/475 test case, and to my surprise, all callers of btrfs_del_root_ref() are already aborting transaction, thus there is not need for such BUG_ON(), just go to @out label and caller will properly handle the error. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: zoned: clear data relocation bg on zone finishJohannes Thumshirn
When finishing a zone that is used by a dedicated data relocation block group, also remove its reference from fs_info, so we're not trying to use a full block group for allocations during data relocation, which will always fail. The result is we're not making any forward progress and end up in a deadlock situation. Fixes: c2707a255623 ("btrfs: zoned: add a dedicated data relocation block group") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: free exchange changeset on failuresJohannes Thumshirn
Fstests runs on my VMs have show several kmemleak reports like the following. unreferenced object 0xffff88811ae59080 (size 64): comm "xfs_io", pid 12124, jiffies 4294987392 (age 6.368s) hex dump (first 32 bytes): 00 c0 1c 00 00 00 00 00 ff cf 1c 00 00 00 00 00 ................ 90 97 e5 1a 81 88 ff ff 90 97 e5 1a 81 88 ff ff ................ backtrace: [<00000000ac0176d2>] ulist_add_merge+0x60/0x150 [btrfs] [<0000000076e9f312>] set_state_bits+0x86/0xc0 [btrfs] [<0000000014fe73d6>] set_extent_bit+0x270/0x690 [btrfs] [<000000004f675208>] set_record_extent_bits+0x19/0x20 [btrfs] [<00000000b96137b1>] qgroup_reserve_data+0x274/0x310 [btrfs] [<0000000057e9dcbb>] btrfs_check_data_free_space+0x5c/0xa0 [btrfs] [<0000000019c4511d>] btrfs_delalloc_reserve_space+0x1b/0xa0 [btrfs] [<000000006d37e007>] btrfs_dio_iomap_begin+0x415/0x970 [btrfs] [<00000000fb8a74b8>] iomap_iter+0x161/0x1e0 [<0000000071dff6ff>] __iomap_dio_rw+0x1df/0x700 [<000000002567ba53>] iomap_dio_rw+0x5/0x20 [<0000000072e555f8>] btrfs_file_write_iter+0x290/0x530 [btrfs] [<000000005eb3d845>] new_sync_write+0x106/0x180 [<000000003fb505bf>] vfs_write+0x24d/0x2f0 [<000000009bb57d37>] __x64_sys_pwrite64+0x69/0xa0 [<000000003eba3fdf>] do_syscall_64+0x43/0x90 In case brtfs_qgroup_reserve_data() or btrfs_delalloc_reserve_metadata() fail the allocated extent_changeset will not be freed. So in btrfs_check_data_free_space() and btrfs_delalloc_reserve_space() free the allocated extent_changeset to get rid of the allocated memory. The issue currently only happens in the direct IO write path, but only after 65b3c08606e5 ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range"), and also at defrag_one_locked_target(). Every other place is always calling extent_changeset_free() even if its call to btrfs_delalloc_reserve_space() or btrfs_check_data_free_space() has failed. CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: fix re-dirty process of tree-log nodesNaohiro Aota
There is a report of a transaction abort of -EAGAIN with the following script. #!/bin/sh for d in sda sdb; do mkfs.btrfs -d single -m single -f /dev/\${d} done mount /dev/sda /mnt/test mount /dev/sdb /mnt/scratch for dir in test scratch; do echo 3 >/proc/sys/vm/drop_caches fio --directory=/mnt/\${dir} --name=fio.\${dir} --rw=read --size=50G --bs=64m \ --numjobs=$(nproc) --time_based --ramp_time=5 --runtime=480 \ --group_reporting |& tee /dev/shm/fio.\${dir} echo 3 >/proc/sys/vm/drop_caches done for d in sda sdb; do umount /dev/\${d} done The stack trace is shown in below. [3310.967991] BTRFS: error (device sda) in btrfs_commit_transaction:2341: errno=-11 unknown (Error while writing out transaction) [3310.968060] BTRFS info (device sda): forced readonly [3310.968064] BTRFS warning (device sda): Skipping commit of aborted transaction. [3310.968065] ------------[ cut here ]------------ [3310.968066] BTRFS: Transaction aborted (error -11) [3310.968074] WARNING: CPU: 14 PID: 1684 at fs/btrfs/transaction.c:1946 btrfs_commit_transaction.cold+0x209/0x2c8 [3310.968131] CPU: 14 PID: 1684 Comm: fio Not tainted 5.14.10-300.fc35.x86_64 #1 [3310.968135] Hardware name: DIAWAY Tartu/Tartu, BIOS V2.01.B10 04/08/2021 [3310.968137] RIP: 0010:btrfs_commit_transaction.cold+0x209/0x2c8 [3310.968144] RSP: 0018:ffffb284ce393e10 EFLAGS: 00010282 [3310.968147] RAX: 0000000000000026 RBX: ffff973f147b0f60 RCX: 0000000000000027 [3310.968149] RDX: ffff974ecf098a08 RSI: 0000000000000001 RDI: ffff974ecf098a00 [3310.968150] RBP: ffff973f147b0f08 R08: 0000000000000000 R09: ffffb284ce393c48 [3310.968151] R10: ffffb284ce393c40 R11: ffffffff84f47468 R12: ffff973f101bfc00 [3310.968153] R13: ffff971f20cf2000 R14: 00000000fffffff5 R15: ffff973f147b0e58 [3310.968154] FS: 00007efe65468740(0000) GS:ffff974ecf080000(0000) knlGS:0000000000000000 [3310.968157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3310.968158] CR2: 000055691bcbe260 CR3: 000000105cfa4001 CR4: 0000000000770ee0 [3310.968160] PKRU: 55555554 [3310.968161] Call Trace: [3310.968167] ? dput+0xd4/0x300 [3310.968174] btrfs_sync_file+0x3f1/0x490 [3310.968180] __x64_sys_fsync+0x33/0x60 [3310.968185] do_syscall_64+0x3b/0x90 [3310.968190] entry_SYSCALL_64_after_hwframe+0x44/0xae [3310.968194] RIP: 0033:0x7efe6557329b [3310.968200] RSP: 002b:00007ffe0236ebc0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [3310.968203] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efe6557329b [3310.968204] RDX: 0000000000000000 RSI: 00007efe58d77010 RDI: 0000000000000006 [3310.968205] RBP: 0000000004000000 R08: 0000000000000000 R09: 00007efe58d77010 [3310.968207] R10: 0000000016cacc0c R11: 0000000000000293 R12: 00007efe5ce95980 [3310.968208] R13: 0000000000000000 R14: 00007efe6447c790 R15: 0000000c80000000 [3310.968212] ---[ end trace 1a346f4d3c0d96ba ]--- [3310.968214] BTRFS: error (device sda) in cleanup_transaction:1946: errno=-11 unknown The abort occurs because of a write hole while writing out freeing tree nodes of a tree-log tree. For zoned btrfs, we re-dirty a freed tree node to ensure btrfs can write the region and does not leave a hole on write on a zoned device. The current code fails to re-dirty a node when the tree-log tree's depth is greater or equal to 2. That leads to a transaction abort with -EAGAIN. Fix the issue by properly re-dirtying a node on walking up the tree. Fixes: d3575156f662 ("btrfs: zoned: redirty released extent buffers") CC: stable@vger.kernel.org # 5.12+ Link: https://github.com/kdave/btrfs-progs/issues/415 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: call mapping_set_error() on btree inode with a write errorJosef Bacik
generic/484 fails sometimes with compression on because the write ends up small enough that it goes into the btree. This means that we never call mapping_set_error() on the inode itself, because the page gets marked as fine when we inline it into the metadata. When the metadata writeback happens we see it and abort the transaction properly and mark the fs as readonly, however we don't do the mapping_set_error() on anything. In syncfs() we will simply return 0 if the sb is marked read-only, so we can't check for this in our syncfs callback. The only way the error gets returned if we called mapping_set_error() on something. Fix this by calling mapping_set_error() on the btree inode mapping. This allows us to properly return an error on syncfs and pass generic/484 with compression on. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: clear extent buffer uptodate when we fail to write itJosef Bacik
I got dmesg errors on generic/281 on our overnight fstests. Looking at the history this happens occasionally, with errors like this WARNING: CPU: 0 PID: 673217 at fs/btrfs/extent_io.c:6848 assert_eb_page_uptodate+0x3f/0x50 CPU: 0 PID: 673217 Comm: kworker/u4:13 Tainted: G W 5.16.0-rc2+ #469 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: btrfs-cache btrfs_work_helper RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 RSP: 0018:ffffae598230bc60 EFLAGS: 00010246 RAX: 0017ffffc0002112 RBX: ffffebaec4100900 RCX: 0000000000001000 RDX: ffffebaec45733c7 RSI: ffffebaec4100900 RDI: ffff9fd98919f340 RBP: 0000000000000d56 R08: ffff9fd98e300000 R09: 0000000000000000 R10: 0001207370a91c50 R11: 0000000000000000 R12: 00000000000007b0 R13: ffff9fd98919f340 R14: 0000000001500000 R15: 0000000001cb0000 FS: 0000000000000000(0000) GS:ffff9fd9fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f549fcf8940 CR3: 0000000114908004 CR4: 0000000000370ef0 Call Trace: extent_buffer_test_bit+0x3f/0x70 free_space_test_bit+0xa6/0xc0 load_free_space_tree+0x1d6/0x430 caching_thread+0x454/0x630 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? rcu_read_lock_sched_held+0x12/0x60 ? lock_release+0x1f0/0x2d0 btrfs_work_helper+0xf2/0x3e0 ? lock_release+0x1f0/0x2d0 ? finish_task_switch.isra.0+0xf9/0x3a0 process_one_work+0x270/0x5a0 worker_thread+0x55/0x3c0 ? process_one_work+0x5a0/0x5a0 kthread+0x174/0x1a0 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because we're trying to read from a extent buffer page that is !PageUptodate. This happens because we will clear the page uptodate when we have an IO error, but we don't clear the extent buffer uptodate. If we do a read later and find this extent buffer we'll think its valid and not return an error, and then trip over this warning. Fix this by also clearing uptodate on the extent buffer when this happens, so that we get an error when we do a btrfs_search_slot() and find this block later. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: fail if fstrim_range->start == U64_MAXJosef Bacik
We've always been failing generic/260 because it's testing things we actually don't care about and thus won't fail for. However we probably should fail for fstrim_range->start == U64_MAX since we clearly can't trim anything past that. This in combination with an update to generic/260 will allow us to pass this test properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08btrfs: fix error pointer dereference in btrfs_ioctl_rm_dev_v2()Dan Carpenter
If memdup_user() fails the error handing will crash when it tries to kfree() an error pointer. Just return directly because there is no cleanup required. Fixes: 1a15eb724aae ("btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-12-08tracefs: Set all files to the same group ownership as the mount optionSteven Rostedt (VMware)
As people have been asking to allow non-root processes to have access to the tracefs directory, it was considered best to only allow groups to have access to the directory, where it is easier to just set the tracefs file system to a specific group (as other would be too dangerous), and that way the admins could pick which processes would have access to tracefs. Unfortunately, this broke tooling on Android that expected the other bit to be set. For some special cases, for non-root tools to trace the system, tracefs would be mounted and change the permissions of the top level directory which gave access to all running tasks permission to the tracing directory. Even though this would be dangerous to do in a production environment, for testing environments this can be useful. Now with the new changes to not allow other (which is still the proper thing to do), it breaks the testing tooling. Now more code needs to be loaded on the system to change ownership of the tracing directory. The real solution is to have tracefs honor the gid=xxx option when mounting. That is, (tracing group tracing has value 1003) mount -t tracefs -o gid=1003 tracefs /sys/kernel/tracing should have it that all files in the tracing directory should be of the given group. Copy the logic from d_walk() from dcache.c and simplify it for the mount case of tracefs if gid is set. All the files in tracefs will be walked and their group will be set to the value passed in. Link: https://lkml.kernel.org/r/20211207171729.2a54e1b3@gandalf.local.home Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-fsdevel@vger.kernel.org Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reported-by: Kalesh Singh <kaleshsingh@google.com> Reported-by: Yabin Cui <yabinc@google.com> Fixes: 49d67e445742 ("tracefs: Have tracefs directories not set OTH permission bits by default") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-08tracefs: Have new files inherit the ownership of their parentSteven Rostedt (VMware)
If directories in tracefs have their ownership changed, then any new files and directories that are created under those directories should inherit the ownership of the director they are created in. Link: https://lkml.kernel.org/r/20211208075720.4855d180@gandalf.local.home Cc: Kees Cook <keescook@chromium.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Yabin Cui <yabinc@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: stable@vger.kernel.org Fixes: 4282d60689d4f ("tracefs: Add new tracefs file system") Reported-by: Kalesh Singh <kaleshsingh@google.com> Reported: https://lore.kernel.org/all/CAC_TJve8MMAv+H_NdLSJXZUSoxOEq2zB_pVaJ9p=7H6Bu3X76g@mail.gmail.com/ Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2021-12-07cifs: Fix crash on unload of cifs_arc4.koVincent Whitchurch
The exit function is wrongly placed in the __init section and this leads to a crash when the module is unloaded. Just remove both the init and exit functions since this module does not need them. Fixes: 71c02863246167b3d ("cifs: fork arc4 and create a separate module...") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Acked-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Cc: stable@vger.kernel.org # 5.15 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-12-07xfs: remove all COW fork extents when remounting readonlyxfs-5.16-fixes-3Darrick J. Wong
As part of multiple customer escalations due to file data corruption after copy on write operations, I wrote some fstests that use fsstress to hammer on COW to shake things loose. Regrettably, I caught some filesystem shutdowns due to incorrect rmap operations with the following loop: mount <filesystem> # (0) fsstress <run only readonly ops> & # (1) while true; do fsstress <run all ops> mount -o remount,ro # (2) fsstress <run only readonly ops> mount -o remount,rw # (3) done When (2) happens, notice that (1) is still running. xfs_remount_ro will call xfs_blockgc_stop to walk the inode cache to free all the COW extents, but the blockgc mechanism races with (1)'s reader threads to take IOLOCKs and loses, which means that it doesn't clean them all out. Call such a file (A). When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which walks the ondisk refcount btree and frees any COW extent that it finds. This function does not check the inode cache, which means that incore COW forks of inode (A) is now inconsistent with the ondisk metadata. If one of those former COW extents are allocated and mapped into another file (B) and someone triggers a COW to the stale reservation in (A), A's dirty data will be written into (B) and once that's done, those blocks will be transferred to (A)'s data fork without bumping the refcount. The results are catastrophic -- file (B) and the refcount btree are now corrupt. Solve this race by forcing the xfs_blockgc_free_space to run synchronously, which causes xfs_icwalk to return to inodes that were skipped because the blockgc code couldn't take the IOLOCK. This is safe to do here because the VFS has already prohibited new writer threads. Fixes: 10ddf64e420f ("xfs: remove leftover CoW reservations when remounting ro") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
2021-12-07netfs: fix parameter of cleanup()Jeffle Xu
The order of these two parameters is just reversed. gcc didn't warn on that, probably because 'void *' can be converted from or to other pointer types without warning. Cc: stable@vger.kernel.org Fixes: 3d3c95046742 ("netfs: Provide readahead and readpage netfs helpers") Fixes: e1b1240c1ff5 ("netfs: Add write_begin helper") Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> Link: https://lore.kernel.org/r/20211207031449.100510-1-jefflexu@linux.alibaba.com/ # v1
2021-12-07netfs: Fix lockdep warning from taking sb_writers whilst holding mmap_lockDavid Howells
Taking sb_writers whilst holding mmap_lock isn't allowed and will result in a lockdep warning like that below. The problem comes from cachefiles needing to take the sb_writers lock in order to do a write to the cache, but being asked to do this by netfslib called from readpage, readahead or write_begin[1]. Fix this by always offloading the write to the cache off to a worker thread. The main thread doesn't need to wait for it, so deadlock can be avoided. This can be tested by running the quick xfstests on something like afs or ceph with lockdep enabled. WARNING: possible circular locking dependency detected 5.15.0-rc1-build2+ #292 Not tainted ------------------------------------------------------ holetest/65517 is trying to acquire lock: ffff88810c81d730 (mapping.invalidate_lock#3){.+.+}-{3:3}, at: filemap_fault+0x276/0x7a5 but task is already holding lock: ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&mm->mmap_lock#2){++++}-{3:3}: validate_chain+0x3c4/0x4a8 __lock_acquire+0x89d/0x949 lock_acquire+0x2dc/0x34b __might_fault+0x87/0xb1 strncpy_from_user+0x25/0x18c removexattr+0x7c/0xe5 __do_sys_fremovexattr+0x73/0x96 do_syscall_64+0x67/0x7a entry_SYSCALL_64_after_hwframe+0x44/0xae -> #1 (sb_writers#10){.+.+}-{0:0}: validate_chain+0x3c4/0x4a8 __lock_acquire+0x89d/0x949 lock_acquire+0x2dc/0x34b cachefiles_write+0x2b3/0x4bb netfs_rreq_do_write_to_cache+0x3b5/0x432 netfs_readpage+0x2de/0x39d filemap_read_page+0x51/0x94 filemap_get_pages+0x26f/0x413 filemap_read+0x182/0x427 new_sync_read+0xf0/0x161 vfs_read+0x118/0x16e ksys_read+0xb8/0x12e do_syscall_64+0x67/0x7a entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (mapping.invalidate_lock#3){.+.+}-{3:3}: check_noncircular+0xe4/0x129 check_prev_add+0x16b/0x3a4 validate_chain+0x3c4/0x4a8 __lock_acquire+0x89d/0x949 lock_acquire+0x2dc/0x34b down_read+0x40/0x4a filemap_fault+0x276/0x7a5 __do_fault+0x96/0xbf do_fault+0x262/0x35a __handle_mm_fault+0x171/0x1b5 handle_mm_fault+0x12a/0x233 do_user_addr_fault+0x3d2/0x59c exc_page_fault+0x85/0xa5 asm_exc_page_fault+0x1e/0x30 other info that might help us debug this: Chain exists of: mapping.invalidate_lock#3 --> sb_writers#10 --> &mm->mmap_lock#2 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mm->mmap_lock#2); lock(sb_writers#10); lock(&mm->mmap_lock#2); lock(mapping.invalidate_lock#3); *** DEADLOCK *** 1 lock held by holetest/65517: #0: ffff8881595b53e8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x28d/0x59c stack backtrace: CPU: 0 PID: 65517 Comm: holetest Not tainted 5.15.0-rc1-build2+ #292 Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014 Call Trace: dump_stack_lvl+0x45/0x59 check_noncircular+0xe4/0x129 ? print_circular_bug+0x207/0x207 ? validate_chain+0x461/0x4a8 ? add_chain_block+0x88/0xd9 ? hlist_add_head_rcu+0x49/0x53 check_prev_add+0x16b/0x3a4 validate_chain+0x3c4/0x4a8 ? check_prev_add+0x3a4/0x3a4 ? mark_lock+0xa5/0x1c6 __lock_acquire+0x89d/0x949 lock_acquire+0x2dc/0x34b ? filemap_fault+0x276/0x7a5 ? rcu_read_unlock+0x59/0x59 ? add_to_page_cache_lru+0x13c/0x13c ? lock_is_held_type+0x7b/0xd3 down_read+0x40/0x4a ? filemap_fault+0x276/0x7a5 filemap_fault+0x276/0x7a5 ? pagecache_get_page+0x2dd/0x2dd ? __lock_acquire+0x8bc/0x949 ? pte_offset_kernel.isra.0+0x6d/0xc3 __do_fault+0x96/0xbf ? do_fault+0x124/0x35a do_fault+0x262/0x35a ? handle_pte_fault+0x1c1/0x20d __handle_mm_fault+0x171/0x1b5 ? handle_pte_fault+0x20d/0x20d ? __lock_release+0x151/0x254 ? mark_held_locks+0x1f/0x78 ? rcu_read_unlock+0x3a/0x59 handle_mm_fault+0x12a/0x233 do_user_addr_fault+0x3d2/0x59c ? pgtable_bad+0x70/0x70 ? rcu_read_lock_bh_held+0xab/0xab exc_page_fault+0x85/0xa5 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x40192f Code: ff 48 89 c3 48 8b 05 50 28 00 00 48 85 ed 7e 23 31 d2 4b 8d 0c 2f eb 0a 0f 1f 00 48 8b 05 39 28 00 00 48 0f af c2 48 83 c2 01 <48> 89 1c 01 48 39 d5 7f e8 8b 0d f2 27 00 00 31 c0 85 c9 74 0e 8b RSP: 002b:00007f9931867eb0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 00007f9931868700 RCX: 00007f993206ac00 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00007ffc13e06ee0 RBP: 0000000000000100 R08: 0000000000000000 R09: 00007f9931868700 R10: 00007f99318689d0 R11: 0000000000000202 R12: 00007ffc13e06ee0 R13: 0000000000000c00 R14: 00007ffc13e06e00 R15: 00007f993206a000 Fixes: 726218fdc22c ("netfs: Define an interface to talk to a cache") Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jeff Layton <jlayton@kernel.org> cc: Jan Kara <jack@suse.cz> cc: linux-cachefs@redhat.com cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20210922110420.GA21576@quack2.suse.cz/ [1] Link: https://lore.kernel.org/r/163887597541.1596626.2668163316598972956.stgit@warthog.procyon.org.uk/ # v1
2021-12-06io-wq: remove spurious bit clear on task_work additionJens Axboe
There's a small race here where the task_work could finish and drop the worker itself, so that by the time that task_work_add() returns with a successful addition we've already put the worker. The worker callbacks clear this bit themselves, so we don't actually need to manually clear it in the caller. Get rid of it. Reported-by: syzbot+b60c982cb0efc5e05a47@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-12-04Merge tag 'xfs-5.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Darrick Wong: "Remove an unnecessary (and backwards) rename flags check that duplicates a VFS level check" * tag 'xfs-5.16-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove incorrect ASSERT in xfs_rename
2021-12-04Merge tag '5.16-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Three SMB3 multichannel/fscache fixes and a DFS fix. In testing multichannel reconnect scenarios recently various problems with the cifs.ko implementation of fscache were found (e.g. incorrect initialization of fscache cookies in some cases)" * tag '5.16-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: avoid use of dstaddr as key for fscache client cookie cifs: add server conn_id to fscache client cookie cifs: wait for tcon resource_id before getting fscache super cifs: fix missed refcounting of ipc tcon