summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-03-10xfs: check owner of dir3 blocksverifier-fixes_2020-03-10Darrick J. Wong
Check the owner field of dir3 block headers. If it's corrupt, release the buffer and return EFSCORRUPTED. All callers handle this properly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: check owner of dir3 data blocksDarrick J. Wong
Check the owner field of dir3 data block headers. If it's corrupt, release the buffer and return EFSCORRUPTED. All callers handle this properly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: check owner of dir3 free blocksDarrick J. Wong
Check the owner field of dir3 free block headers and reject the metadata if there's something wrong with it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-03-10xfs: don't ever return a stale pointer from __xfs_dir3_free_readDarrick J. Wong
If we decide that a directory free block is corrupt, we must take care not to leak a buffer pointer to the caller. After xfs_trans_brelse returns, the buffer can be freed or reused, which means that we have to set *bpp back to NULL. Callers are supposed to notice the nonzero return value and not use the buffer pointer, but we should code more defensively, even if all current callers handle this situation correctly. Fixes: de14c5f541e7 ("xfs: verify free block header fields") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: fix buffer corruption reporting when xfs_dir3_free_header_check failsDarrick J. Wong
xfs_verifier_error is supposed to be called on a corrupt metadata buffer from within a buffer verifier function, whereas xfs_buf_corruption_error is the function to be called when a piece of code has read a buffer and catches something that a read verifier cannot. The first function sets b_error anticipating that the low level buffer handling code will see the nonzero b_error and clear XBF_DONE on the buffer, whereas the second function does not. Since xfs_dir3_free_header_check examines fields in the dir free block header that require more context than can be provided to read verifiers, we must call xfs_buf_corruption_error when it finds a problem. Switching the calls has a secondary effect that we no longer corrupt the buffer state by setting b_error and leaving XBF_DONE set. When /that/ happens, we'll trip over various state assertions (most commonly the b_error check in xfs_buf_reverify) on a subsequent attempt to read the buffer. Fixes: bc1a09b8e334bf5f ("xfs: refactor verifier callers to print address of failing check") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: xfs_buf_corruption_error should take __this_addressDarrick J. Wong
Add a xfs_failaddr_t parameter to this function so that callers can potentially pass in (and therefore report) the exact point in the code where we decided that a metadata buffer was corrupt. This enables us to wire it up to checking functions that have to run outside of verifiers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: add a function to deal with corrupt buffers post-verifiersDarrick J. Wong
Add a helper function to get rid of buffers that we have decided are corrupt after the verifiers have run. This function is intended to handle metadata checks that can't happen in the verifiers, such as inter-block relationship checking. Note that we now mark the buffer stale so that it will not end up on any LRU and will be purged on release. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: fix xfs_rmap_has_other_keys usage of ECANCELEDrandom-fixes_2020-03-10Darrick J. Wong
In e7ee96dfb8c26, we converted all ITER_ABORT users to use ECANCELED instead, but we forgot to teach xfs_rmap_has_other_keys not to return that magic value to callers. Fix it now. Fixes: e7ee96dfb8c26 ("xfs: remove all *_ITER_ABORT values") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: fix use-after-free when aborting corrupt attr inactivationDarrick J. Wong
Log the corrupt buffer before we release the buffer. Fixes: a5155b870d687 ("xfs: always log corruption errors") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: increase the default parallelism levels of pwork clientspwork-parallelism_2020-03-10Darrick J. Wong
Increase the default parallelism level for pwork clients so that we can take advantage of computers with a lot of CPUs and a lot of hardware. 8x raid0 spinning rust running quotacheck: 1 39s 2 29s 4 26s 8 24s 24 (nr_cpus) 24s 4x raid0 sata ssds running quotacheck: 1 12s 2 12s 4 12s 8 13s 24 (nr_cpus) 14s 4x raid0 nvme ssds running quotacheck: 1 18s 2 18s 4 19s 8 20s 20 (nr_cpus) 20s So, mixed results... Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: measure all contiguous previous extents for prealloc sizestale-exposure_2020-03-10Darrick J. Wong
When we're estimating a new speculative preallocation length for an extending write, we should walk backwards through the extent list to determine the number of number of blocks that are physically and logically contiguous with the write offset, and use that as an input to the preallocation size computation. This way, preallocation length is truly measured by the effectiveness of the allocator in giving us contiguous allocations without being influenced by the state of a given extent. This fixes both the problem where ZERO_RANGE within an EOF can reduce preallocation, and prevents the unnecessary shrinkage of preallocation when delalloc extents are turned into unwritten extents. This was found as a regression in xfs/014 after changing delalloc writes to create unwritten extents during writeback. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: don't fail unwritten extent conversion on writeback due to edquotDarrick J. Wong
During writeback, it's possible for the quota block reservation in xfs_iomap_write_unwritten to fail with EDQUOT because we hit the quota limit. This causes writeback errors for data that was already written to disk, when it's not even guaranteed that the bmbt will expand to exceed the quota limit. Irritatingly, this condition is reported to userspace as EIO by fsync, which is confusing. We wrote the data, so allow the reservation. That might put us slightly above the hard limit, but it's better than losing data after a write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-10xfs: force writes to delalloc regions to unwrittenDarrick J. Wong
When writing to a delalloc region in the data fork, commit the new allocations (of the da reservation) as unwritten so that the mappings are only marked written once writeback completes successfully. This fixes the problem of stale data exposure if the system goes down during targeted writeback of a specific region of a file, as tested by generic/042. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-03-08Merge tag 'driver-core-5.6-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core and debugfs fixes from Greg KH: "Here are four small driver core / debugfs patches for 5.6-rc3: - debugfs api cleanup now that all debugfs_create_regset32() callers have been fixed up. This was waiting until after the -rc1 merge as these fixes came in through different trees - driver core sync state fixes based on reports of minor issues found in the feature All of these have been in linux-next with no reported issues" * tag 'driver-core-5.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: driver core: Skip unnecessary work when device doesn't have sync_state() driver core: Add dev_has_sync_state() driver core: Call sync_state() even if supplier has no consumers debugfs: remove return value of debugfs_create_regset32()
2020-03-07Merge tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Here are a few io_uring fixes that should go into this release. This contains: - Removal of (now) unused io_wq_flush() and associated flag (Pavel) - Fix cancelation lockup with linked timeouts (Pavel) - Fix for potential use-after-free when freeing percpu ref for fixed file sets - io-wq cancelation fixups (Pavel)" * tag 'io_uring-5.6-2020-03-07' of git://git.kernel.dk/linux-block: io_uring: fix lockup with timeouts io_uring: free fixed_file_data after RCU grace period io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNAL io-wq: fix IO_WQ_WORK_NO_CANCEL cancellation
2020-03-07io_uring: fix lockup with timeoutsio_uring-5.6-2020-03-07Pavel Begunkov
There is a recipe to deadlock the kernel: submit a timeout sqe with a linked_timeout (e.g. test_single_link_timeout_ception() from liburing), and SIGKILL the process. Then, io_kill_timeouts() takes @ctx->completion_lock, but the timeout isn't flagged with REQ_F_COMP_LOCKED, and will try to double grab it during io_put_free() to cancel the linked timeout. Probably, the same can happen with another io_kill_timeout() call site, that is io_commit_cqring(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06Merge tag 'for-5.6-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One fixup for DIO when in use with the new checksums, a missed case where the checksum size was still assuming u32" * tag 'for-5.6-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix RAID direct I/O reads with alternate csums
2020-03-06Merge tag 'filelock-v5.6-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux Pull file locking fixes from Jeff Layton: "Just a couple of late-breaking patches for the file locking code. The second patch (from yangerkun) fixes a rather nasty looking potential use-after-free that should go to stable. The other patch could technically wait for 5.7, but it's fairly innocuous so I figured we might as well take it" * tag 'filelock-v5.6-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: locks: fix a potential use-after-free problem when wakeup a waiter fcntl: Distribute switch variables for initialization
2020-03-06io_uring: free fixed_file_data after RCU grace periodJens Axboe
The percpu refcount protects this structure, and we can have an atomic switch in progress when exiting. This makes it unsafe to just free the struct normally, and can trigger the following KASAN warning: BUG: KASAN: use-after-free in percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 Read of size 1 at addr ffff888181a19a30 by task swapper/0/0 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.6.0-rc4+ #5747 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 Call Trace: <IRQ> dump_stack+0x76/0xa0 print_address_description.constprop.0+0x3b/0x60 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 __kasan_report.cold+0x1a/0x3d ? percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 percpu_ref_switch_to_atomic_rcu+0xfa/0x1b0 rcu_core+0x370/0x830 ? percpu_ref_exit+0x50/0x50 ? rcu_note_context_switch+0x7b0/0x7b0 ? run_rebalance_domains+0x11d/0x140 __do_softirq+0x10a/0x3e9 irq_exit+0xd5/0xe0 smp_apic_timer_interrupt+0x86/0x200 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x26/0x1f0 Fix this by punting the final exit and free of the struct to RCU, then we know that it's safe to do so. Jann suggested the approach of using a double rcu callback to achieve this. It's important that we do a nested call_rcu() callback, as otherwise the free could be ordered before the atomic switch, even if the latter was already queued. Reported-by: syzbot+e017e49c39ab484ac87a@syzkaller.appspotmail.com Suggested-by: Jann Horn <jannh@google.com> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-06locks: fix a potential use-after-free problem when wakeup a waiteryangerkun
'16306a61d3b7 ("fs/locks: always delete_block after waiting.")' add the logic to check waiter->fl_blocker without blocked_lock_lock. And it will trigger a UAF when we try to wakeup some waiter: Thread 1 has create a write flock a on file, and now thread 2 try to unlock and delete flock a, thread 3 try to add flock b on the same file. Thread2 Thread3 flock syscall(create flock b) ...flock_lock_inode_wait flock_lock_inode(will insert our fl_blocked_member list to flock a's fl_blocked_requests) sleep flock syscall(unlock) ...flock_lock_inode_wait locks_delete_lock_ctx ...__locks_wake_up_blocks __locks_delete_blocks( b->fl_blocker = NULL) ... break by a signal locks_delete_block b->fl_blocker == NULL && list_empty(&b->fl_blocked_requests) success, return directly locks_free_lock b wake_up(&b->fl_waiter) trigger UAF Fix it by remove this logic, and this patch may also fix CVE-2019-19769. Cc: stable@vger.kernel.org Fixes: 16306a61d3b7 ("fs/locks: always delete_block after waiting.") Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2020-03-06fat: fix uninit-memory access for partial initialized inodeOGAWA Hirofumi
When get an error in the middle of reading an inode, some fields in the inode might be still not initialized. And then the evict_inode path may access those fields via iput(). To fix, this makes sure that inode fields are initialized. Reported-by: syzbot+9d82b8de2992579da5d0@syzkaller.appspotmail.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/871rqnreqx.fsf@mail.parknet.co.jp Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-03Merge tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Five small cifs/smb3 fixes, two for stable (one for a reconnect problem and the other fixes a use case when renaming an open file)" * tag '5.6-rc4-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: Use #define in cifs_dbg cifs: fix rename() by ensuring source handle opened with DELETE bit cifs: add missing mount option to /proc/mounts cifs: fix potential mismatch of UNC paths cifs: don't leak -EAGAIN for stat() during reconnect
2020-03-03fcntl: Distribute switch variables for initializationKees Cook
Variables declared in a switch statement before any case statements cannot be automatically initialized with compiler instrumentation (as they are not part of any execution flow). With GCC's proposed automatic stack variable initialization feature, this triggers a warning (and they don't get initialized). Clang's automatic stack variable initialization (via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't initialize such variables[1]. Note that these warnings (or silent skipping) happen before the dead-store elimination optimization phase, so even when the automatic initializations are later elided in favor of direct initializations, the warnings remain. To avoid these problems, move such variables into the "case" where they're used or lift them up into the main function body. fs/fcntl.c: In function ‘send_sigio_to_task’: fs/fcntl.c:738:20: warning: statement will never be executed [-Wswitch-unreachable] 738 | kernel_siginfo_t si; | ^~ [1] https://bugs.llvm.org/show_bug.cgi?id=44916 Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Jeff Layton <jlayton@kernel.org>
2020-03-03btrfs: fix RAID direct I/O reads with alternate csumsOmar Sandoval
btrfs_lookup_and_bind_dio_csum() does pointer arithmetic which assumes 32-bit checksums. If using a larger checksum, this leads to spurious failures when a direct I/O read crosses a stripe. This is easy to reproduce: # mkfs.btrfs -f --checksum blake2 -d raid0 /dev/vdc /dev/vdd ... # mount /dev/vdc /mnt # cd /mnt # dd if=/dev/urandom of=foo bs=1M count=1 status=none # dd if=foo of=/dev/null bs=1M iflag=direct status=none dd: error reading 'foo': Input/output error # dmesg | tail -1 [ 135.821568] BTRFS warning (device vdc): csum failed root 5 ino 257 off 421888 ... Fix it by using the actual checksum size. Fixes: 1e25a2e3ca0d ("btrfs: don't assume ordered sums to be 4 bytes") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-03-02io-wq: remove io_wq_flush and IO_WQ_WORK_INTERNALPavel Begunkov
io_wq_flush() is buggy, during cancelation of a flush, the associated work may be passed to the caller's (i.e. io_uring) @match callback. That callback is expecting it to be embedded in struct io_kiocb. Cancelation of internal work probably doesn't make a lot of sense to begin with. As the flush helper is no longer used, just delete it and the associated work flag. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-02io-wq: fix IO_WQ_WORK_NO_CANCEL cancellationPavel Begunkov
To cancel a work, io-wq sets IO_WQ_WORK_CANCEL and executes the callback. However, IO_WQ_WORK_NO_CANCEL works will just execute and may return next work, which will be ignored and lost. Cancel the whole link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-01Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Two more bug fixes (including a regression) for 5.6" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: potential crash on allocation error in ext4_alloc_flex_bg_array() jbd2: fix data races at struct journal_head
2020-02-29ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()Dan Carpenter
If sbi->s_flex_groups_allocated is zero and the first allocation fails then this code will crash. The problem is that "i--" will set "i" to -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1 is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX is more than zero, the condition is true so we call kvfree(new_groups[-1]). The loop will carry on freeing invalid memory until it crashes. Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access") Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-29jbd2: fix data races at struct journal_headQian Cai
journal_head::b_transaction and journal_head::b_next_transaction could be accessed concurrently as noticed by KCSAN, LTP: starting fsync04 /dev/zero: Can't open blockdev EXT4-fs (loop0): mounting ext3 file system using the ext4 subsystem EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null) ================================================================== BUG: KCSAN: data-race in __jbd2_journal_refile_buffer [jbd2] / jbd2_write_access_granted [jbd2] write to 0xffff99f9b1bd0e30 of 8 bytes by task 25721 on cpu 70: __jbd2_journal_refile_buffer+0xdd/0x210 [jbd2] __jbd2_journal_refile_buffer at fs/jbd2/transaction.c:2569 jbd2_journal_commit_transaction+0x2d15/0x3f20 [jbd2] (inlined by) jbd2_journal_commit_transaction at fs/jbd2/commit.c:1034 kjournald2+0x13b/0x450 [jbd2] kthread+0x1cd/0x1f0 ret_from_fork+0x27/0x50 read to 0xffff99f9b1bd0e30 of 8 bytes by task 25724 on cpu 68: jbd2_write_access_granted+0x1b2/0x250 [jbd2] jbd2_write_access_granted at fs/jbd2/transaction.c:1155 jbd2_journal_get_write_access+0x2c/0x60 [jbd2] __ext4_journal_get_write_access+0x50/0x90 [ext4] ext4_mb_mark_diskspace_used+0x158/0x620 [ext4] ext4_mb_new_blocks+0x54f/0xca0 [ext4] ext4_ind_map_blocks+0xc79/0x1b40 [ext4] ext4_map_blocks+0x3b4/0x950 [ext4] _ext4_get_block+0xfc/0x270 [ext4] ext4_get_block+0x3b/0x50 [ext4] __block_write_begin_int+0x22e/0xae0 __block_write_begin+0x39/0x50 ext4_write_begin+0x388/0xb50 [ext4] generic_perform_write+0x15d/0x290 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe 5 locks held by fsync04/25724: #0: ffff99f9911093f8 (sb_writers#13){.+.+}, at: vfs_write+0x21c/0x260 #1: ffff99f9db4c0348 (&sb->s_type->i_mutex_key#15){+.+.}, at: ext4_buffered_write_iter+0x65/0x210 [ext4] #2: ffff99f5e7dfcf58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2] #3: ffff99f9db4c0168 (&ei->i_data_sem){++++}, at: ext4_map_blocks+0x176/0x950 [ext4] #4: ffffffff99086b40 (rcu_read_lock){....}, at: jbd2_write_access_granted+0x4e/0x250 [jbd2] irq event stamp: 1407125 hardirqs last enabled at (1407125): [<ffffffff980da9b7>] __find_get_block+0x107/0x790 hardirqs last disabled at (1407124): [<ffffffff980da8f9>] __find_get_block+0x49/0x790 softirqs last enabled at (1405528): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c softirqs last disabled at (1405521): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0 Reported by Kernel Concurrency Sanitizer on: CPU: 68 PID: 25724 Comm: fsync04 Tainted: G L 5.6.0-rc2-next-20200221+ #7 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 The plain reads are outside of jh->b_state_lock critical section which result in data races. Fix them by adding pairs of READ|WRITE_ONCE(). Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/20200222043111.2227-1-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-02-28Merge tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: - Fix for a race with IOPOLL used with SQPOLL (Xiaoguang) - Only show ->fdinfo if procfs is enabled (Tobias) - Fix for a chain with multiple personalities in the SQEs - Fix for a missing free of personality idr on exit - Removal of the spin-for-work optimization - Fix for next work lookup on request completion - Fix for non-vec read/write result progation in case of links - Fix for a fileset references on switch - Fix for a recvmsg/sendmsg 32-bit compatability mode * tag 'io_uring-5.6-2020-02-28' of git://git.kernel.dk/linux-block: io_uring: fix 32-bit compatability with sendmsg/recvmsg io_uring: define and set show_fdinfo only if procfs is enabled io_uring: drop file set ref put/get on switch io_uring: import_single_range() returns 0/-ERROR io_uring: pick up link work on submit reference drop io-wq: ensure work->task_pid is cleared on init io-wq: remove spin-for-work optimization io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLL io_uring: fix personality idr leak io_uring: handle multiple personalities in link chains
2020-02-28Merge tag 'zonefs-5.6-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs Pull zonefs fixes from Damien Le Moal: "Two fixes in here: - Revert the initial decision to silently ignore IOCB_NOWAIT for asynchronous direct IOs to sequential zone files. Instead, return an error to the user to signal that the feature is not supported (from Christoph) - A fix to zonefs Kconfig to select FS_IOMAP to avoid build failures if no other file system already selected this option (from Johannes)" * tag 'zonefs-5.6-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs: zonefs: select FS_IOMAP zonefs: fix IOCB_NOWAIT handling
2020-02-27io_uring: fix 32-bit compatability with sendmsg/recvmsgio_uring-5.6-2020-02-28Jens Axboe
We must set MSG_CMSG_COMPAT if we're in compatability mode, otherwise the iovec import for these commands will not do the right thing and fail the command with -EINVAL. Found by running the test suite compiled as 32-bit. Cc: stable@vger.kernel.org Fixes: aa1fa28fc73e ("io_uring: add support for recvmsg()") Fixes: 0fa03c624d8f ("io_uring: add support for sendmsg()") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-27io_uring: define and set show_fdinfo only if procfs is enabledTobias Klauser
Follow the pattern used with other *_show_fdinfo functions and only define and use io_uring_show_fdinfo and its helper functions if CONFIG_PROC_FS is set. Fixes: 87ce955b24c9 ("io_uring: add ->show_fdinfo() for the io_uring file descriptor") Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26io_uring: drop file set ref put/get on switchJens Axboe
Dan reports that he triggered a warning on ring exit doing some testing: percpu ref (io_file_data_ref_zero) <= 0 (0) after switching to atomic WARNING: CPU: 3 PID: 0 at lib/percpu-refcount.c:160 percpu_ref_switch_to_atomic_rcu+0xe8/0xf0 Modules linked in: CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.6.0-rc3+ #5648 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:percpu_ref_switch_to_atomic_rcu+0xe8/0xf0 Code: e7 ff 55 e8 eb d2 80 3d bd 02 d2 00 00 75 8b 48 8b 55 d8 48 c7 c7 e8 70 e6 81 c6 05 a9 02 d2 00 01 48 8b 75 e8 e8 3a d0 c5 ff <0f> 0b e9 69 ff ff ff 90 55 48 89 fd 53 48 89 f3 48 83 ec 28 48 83 RSP: 0018:ffffc90000110ef8 EFLAGS: 00010292 RAX: 0000000000000045 RBX: 7fffffffffffffff RCX: 0000000000000000 RDX: 0000000000000045 RSI: ffffffff825be7a5 RDI: ffffffff825bc32c RBP: ffff8881b75eac38 R08: 000000042364b941 R09: 0000000000000045 R10: ffffffff825beb40 R11: ffffffff825be78a R12: 0000607e46005aa0 R13: ffff888107dcdd00 R14: 0000000000000000 R15: 0000000000000009 FS: 0000000000000000(0000) GS:ffff8881b9d80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f49e6a5ea20 CR3: 00000001b747c004 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> rcu_core+0x1e4/0x4d0 __do_softirq+0xdb/0x2f1 irq_exit+0xa0/0xb0 smp_apic_timer_interrupt+0x60/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:default_idle+0x23/0x170 Code: ff eb ab cc cc cc cc 0f 1f 44 00 00 41 54 55 53 65 8b 2d 10 96 92 7e 0f 1f 44 00 00 e9 07 00 00 00 0f 00 2d 21 d0 51 00 fb f4 <65> 8b 2d f6 95 92 7e 0f 1f 44 00 00 5b 5d 41 5c c3 65 8b 05 e5 95 Turns out that this is due to percpu_ref_switch_to_atomic() only grabbing a reference to the percpu refcount if it's not already in atomic mode. io_uring drops a ref and re-gets it when switching back to percpu mode. We attempt to protect against this with the FFD_F_ATOMIC bit, but that isn't reliable. We don't actually need to juggle these refcounts between atomic and percpu switch, we can just do them when we've switched to atomic mode. This removes the need for FFD_F_ATOMIC, which wasn't reliable. Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26io_uring: import_single_range() returns 0/-ERRORJens Axboe
Unlike the other core import helpers, import_single_range() returns 0 on success, not the length imported. This means that links that depend on the result of non-vec based IORING_OP_{READ,WRITE} that were added for 5.5 get errored when they should not be. Fixes: 3a6820f2bb8a ("io_uring: add non-vectored read/write commands") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26io_uring: pick up link work on submit reference dropJens Axboe
If work completes inline, then we should pick up a dependent link item in __io_queue_sqe() as well. If we don't do so, we're forced to go async with that item, which is suboptimal. This also fixes an issue with io_put_req_find_next(), which always looks up the next work item. That should only be done if we're dropping the last reference to the request, to prevent multiple lookups of the same work item. Outside of being a fix, this also enables a good cleanup series for 5.7, where we never have to pass 'nxt' around or into the work handlers. Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-26zonefs: select FS_IOMAPJohannes Thumshirn
Zonefs makes use of iomap internally, so it should also select iomap in Kconfig. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-02-26zonefs: fix IOCB_NOWAIT handlingChristoph Hellwig
IOCB_NOWAIT can't just be ignored as it breaks applications expecting it not to block. Just refuse the operation as applications must handle that (e.g. by falling back to a thread pool). Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
2020-02-25io-wq: ensure work->task_pid is cleared on initJens Axboe
We use ->task_pid for exit cancellation, but we need to ensure it's cleared to zero for io_req_work_grab_env() to do the right thing. Take a suggestion from Bart and clear the whole thing, just setting the function passed in. This makes it more future proof as well. Fixes: 36282881a795 ("io-wq: add io_wq_cancel_pid() to cancel based on a specific pid") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-25io-wq: remove spin-for-work optimizationJens Axboe
Andres reports that buffered IO seems to suck up more cycles than we would like, and he narrowed it down to the fact that the io-wq workers will briefly spin for more work on completion of a work item. This was a win on the networking side, but apparently some other cases take a hit because of it. Remove the optimization to avoid burning more CPU than we have to for disk IO. Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-25io_uring: fix poll_list race for SETUP_IOPOLL|SETUP_SQPOLLXiaoguang Wang
After making ext4 support iopoll method: let ext4_file_operations's iopoll method be iomap_dio_iopoll(), we found fio can easily hang in fio_ioring_getevents() with below fio job: rm -f testfile; sync; sudo fio -name=fiotest -filename=testfile -iodepth=128 -thread -rw=write -ioengine=io_uring -hipri=1 -sqthread_poll=1 -direct=1 -bs=4k -size=10G -numjobs=8 -runtime=2000 -group_reporting with IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL enabled. There are two issues that results in this hang, one reason is that when IORING_SETUP_SQPOLL and IORING_SETUP_IOPOLL are enabled, fio does not use io_uring_enter to get completed events, it relies on kernel io_sq_thread to poll for completed events. Another reason is that there is a race: when io_submit_sqes() in io_sq_thread() submits a batch of sqes, variable 'inflight' will record the number of submitted reqs, then io_sq_thread will poll for reqs which have been added to poll_list. But note, if some previous reqs have been punted to io worker, these reqs will won't be in poll_list timely. io_sq_thread() will only poll for a part of previous submitted reqs, and then find poll_list is empty, reset variable 'inflight' to be zero. If app just waits these deferred reqs and does not wake up io_sq_thread again, then hang happens. For app that entirely relies on io_sq_thread to poll completed requests, let io_iopoll_req_issued() wake up io_sq_thread properly when adding new element to poll_list, and when io_sq_thread prepares to sleep, check whether poll_list is empty again, if not empty, continue to poll. Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-24cifs: Use #define in cifs_dbgJoe Perches
All other uses of cifs_dbg use defines so change this one. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2020-02-24cifs: fix rename() by ensuring source handle opened with DELETE bitAurelien Aptel
To rename a file in SMB2 we open it with the DELETE access and do a special SetInfo on it. If the handle is missing the DELETE bit the server will fail the SetInfo with STATUS_ACCESS_DENIED. We currently try to reuse any existing opened handle we have with cifs_get_writable_path(). That function looks for handles with WRITE access but doesn't check for DELETE, making rename() fail if it finds a handle to reuse. Simple reproducer below. To select handles with the DELETE bit, this patch adds a flag argument to cifs_get_writable_path() and find_writable_file() and the existing 'bool fsuid_only' argument is converted to a flag. The cifsFileInfo struct only stores the UNIX open mode but not the original SMB access flags. Since the DELETE bit is not mapped in that mode, this patch stores the access mask in cifs_fid on file open, which is accessible from cifsFileInfo. Simple reproducer: #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #define E(s) perror(s), exit(1) int main(int argc, char *argv[]) { int fd, ret; if (argc != 3) { fprintf(stderr, "Usage: %s A B\n" "create&open A in write mode, " "rename A to B, close A\n", argv[0]); return 0; } fd = openat(AT_FDCWD, argv[1], O_WRONLY|O_CREAT|O_SYNC, 0666); if (fd == -1) E("openat()"); ret = rename(argv[1], argv[2]); if (ret) E("rename()"); ret = close(fd); if (ret) E("close()"); return ret; } $ gcc -o bugrename bugrename.c $ ./bugrename /mnt/a /mnt/b rename(): Permission denied Fixes: 8de9e86c67ba ("cifs: create a helper to find a writeable handle by path name") CC: Stable <stable@vger.kernel.org> Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz>
2020-02-24cifs: add missing mount option to /proc/mountsSteve French
We were not displaying the mount option "signloosely" in /proc/mounts for cifs mounts which some users found confusing recently Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com>
2020-02-24cifs: fix potential mismatch of UNC pathsPaulo Alcantara (SUSE)
Ensure that full_path is an UNC path that contains '\\' as delimiter, which is required by cifs_build_devname(). The build_path_from_dentry_optional_prefix() function may return a path with '/' as delimiter when using SMB1 UNIX extensions, for example. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
2020-02-24cifs: don't leak -EAGAIN for stat() during reconnectRonnie Sahlberg
If from cifs_revalidate_dentry_attr() the SMB2/QUERY_INFO call fails with an error, such as STATUS_SESSION_EXPIRED, causing the session to be reconnected it is possible we will leak -EAGAIN back to the application even for system calls such as stat() where this is not a valid error. Fix this by re-trying the operation from within cifs_revalidate_dentry_attr() if cifs_get_inode_info*() returns -EAGAIN. This fixes stat() and possibly also other system calls that uses cifs_revalidate_dentry*(). Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Reviewed-by: Aurelien Aptel <aaptel@suse.com> CC: Stable <stable@vger.kernel.org>
2020-02-24io_uring: fix personality idr leakJens Axboe
We somehow never free the idr, even though we init it for every ctx. Free it when the rest of the ring data is freed. Fixes: 071698e13ac6 ("io_uring: allow registering credentials") Reviewed-by: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-23io_uring: handle multiple personalities in link chainsJens Axboe
If we have a chain of requests and they don't all use the same credentials, then the head of the chain will be issued with the credentails of the tail of the chain. Ensure __io_queue_sqe() overrides the credentials, if they are different. Once we do that, we can clean up the creds handling as well, by only having io_submit_sqe() do the lookup of a personality. It doesn't need to assign it, since __io_queue_sqe() now always does the right thing. Fixes: 75c6a03904e0 ("io_uring: support using a registered personality for commands") Reported-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-23Merge tag 'for-5.6-rc2-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "These are fixes that were found during testing with help of error injection, plus some other stable material. There's a fixup to patch added to rc1 causing locking in wrong context warnings, tests found one more deadlock scenario. The patches are tagged for stable, two of them now in the queue but we'd like all three released at the same time. I'm not happy about fixes to fixes in such a fast succession during rcs, but I hope we found all the fallouts of commit 28553fa992cb ('Btrfs: fix race between shrinking truncate and fiemap')" * tag 'for-5.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof Btrfs: fix btrfs_wait_ordered_range() so that it waits for all ordered extents btrfs: fix bytes_may_use underflow in prealloc error condtition btrfs: handle logged extent failure properly btrfs: do not check delayed items are empty for single transaction cleanup btrfs: reset fs_root to NULL on error in open_ctree btrfs: destroy qgroup extent records on transaction abort
2020-02-23Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "More miscellaneous ext4 bug fixes (all stable fodder)" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix mount failure with quota configured as module jbd2: fix ocfs2 corrupt when clearing block group bits ext4: fix race between writepages and enabling EXT4_EXTENTS_FL ext4: rename s_journal_flag_rwsem to s_writepages_rwsem ext4: fix potential race between s_flex_groups online resizing and access ext4: fix potential race between s_group_info online resizing and access ext4: fix potential race between online resizing and write operations ext4: add cond_resched() to __ext4_find_entry() ext4: fix a data race in EXT4_I(inode)->i_disksize