summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-09-15fuse: flush extending writesMiklos Szeredi
commit 59bda8ecee2ffc6a602b7bf2b9e43ca669cdbdcd upstream. Callers of fuse_writeback_range() assume that the file is ready for modification by the server in the supplied byte range after the call returns. If there's a write that extends the file beyond the end of the supplied range, then the file needs to be extended to at least the end of the range, but currently that's not done. There are at least two cases where this can cause problems: - copy_file_range() will return short count if the file is not extended up to end of the source range. - FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE will not extend the file, hence the region may not be fully allocated. Fix by flushing writes from the start of the range up to the end of the file. This could be optimized if the writes are non-extending, etc, but it's probably not worth the trouble. Fixes: a2bc92362941 ("fuse: fix copy_file_range() in the writeback case") Fixes: 6b1bdb56b17c ("fuse: allow fallocate(FALLOC_FL_ZERO_RANGE)") Cc: <stable@vger.kernel.org> # v5.2 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15fuse: truncate pagecache on atomic_o_truncMiklos Szeredi
commit 76224355db7570cbe6b6f75c8929a1558828dd55 upstream. fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in fuse_launder_page() triggered by invalidate_inode_pages2(). Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE or fc->writeback cache, so do it unconditionally. Reported-by: Xie Yongji <xieyongji@bytedance.com> Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com Fixes: e4648309b85a ("fuse: truncate pending writes on O_TRUNC") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15io_uring: IORING_OP_WRITE needs hash_reg_file setJens Axboe
commit 7b3188e7ed54102a5dcc73d07727f41fb528f7c8 upstream. During some testing, it became evident that using IORING_OP_WRITE doesn't hash buffered writes like the other writes commands do. That's simply an oversight, and can cause performance regressions when doing buffered writes with this command. Correct that and add the flag, so that buffered writes are correctly hashed when using the non-iovec based write command. Cc: stable@vger.kernel.org Fixes: 3a6820f2bb8a ("io_uring: add non-vectored read/write commands") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15f2fs: guarantee to write dirty data when enabling checkpoint backJaegeuk Kim
commit dddd3d65293a52c2c3850c19b1e5115712e534d8 upstream. We must flush all the dirty data when enabling checkpoint back. Let's guarantee that first by adding a retry logic on sync_inodes_sb(). In addition to that, this patch adds to flush data in fsync when checkpoint is disabled, which can mitigate the sync_inodes_sb() failures in advance. Reviewed-by: Chao Yu <chao@kernel.org> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-15CIFS: Fix a potencially linear read overflowLen Baker
[ Upstream commit f980d055a0f858d73d9467bb0b570721bbfcdfb8 ] strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated. Also, the strnlen() call does not avoid the read overflow in the strlcpy function when a not NUL-terminated string is passed. So, replace this block by a call to kstrndup() that avoids this type of overflow and does the same. Fixes: 066ce6899484d ("cifs: rename cifs_strlcpy_to_host and make it use new functions") Signed-off-by: Len Baker <len.baker@gmx.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15gfs2: init system threads before freeze lockBob Peterson
[ Upstream commit a28dc123fa66ba7f3eca7cffc4b01d96bfd35c27 ] Patch 96b1454f2e ("gfs2: move freeze glock outside the make_fs_rw and _ro functions") changed the gfs2 mount sequence so that it holds the freeze lock before calling gfs2_make_fs_rw. Before this patch, gfs2_make_fs_rw called init_threads to initialize the quotad and logd threads. That is a problem if the system needs to withdraw due to IO errors early in the mount sequence, for example, while initializing the system statfs inode: 1. An IO error causes the statfs glock to not sync properly after recovery, and leaves items on the ail list. 2. The leftover items on the ail list causes its do_xmote call to fail, which makes it want to withdraw. But since the glock code cannot withdraw (because the withdraw sequence uses glocks) it relies upon the logd daemon to initiate the withdraw. 3. The withdraw can never be performed by the logd daemon because all this takes place before the logd daemon is started. This patch moves function init_threads from super.c to ops_fstype.c and it changes gfs2_fill_super to start its threads before holding the freeze lock, and if there's an error, stop its threads after releasing it. This allows the logd to run unblocked by the freeze lock. Thus, the logd daemon can perform its withdraw sequence properly. Fixes: 96b1454f2e8e ("gfs2: move freeze glock outside the make_fs_rw and _ro functions") Signed-off-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15mm/swap: consider max pages in iomap_swapfile_add_extentXu Yu
[ Upstream commit 36ca7943ac18aebf8aad4c50829eb2ea5ec847df ] When the max pages (last_page in the swap header + 1) is smaller than the total pages (inode size) of the swapfile, iomap_swapfile_activate overwrites sis->max with total pages. However, frontswap_map is a swap page state bitmap allocated using the initial sis->max page count read from the swap header. If swapfile activation increases sis->max, it's possible for the frontswap code to walk off the end of the bitmap, thereby corrupting kernel memory. [djwong: modify the description a bit; the original paragraph reads: "However, frontswap_map is allocated using max pages. When test and clear the sis offset, which is larger than max pages, of frontswap_map in __frontswap_invalidate_page(), neighbors of frontswap_map may be overwritten, i.e., slab is polluted." Note also that this bug resulted in a behavioral change: activating a swap file that was formatted and later extended results in all pages being activated, not the number of pages recorded in the swap header.] This fixes the issue by considering the limitation of max pages of swap info in iomap_swapfile_add_extent(). To reproduce the case, compile kernel with slub RED ZONE, then run test: $ sudo stress-ng -a 1 -x softlockup,resources -t 72h --metrics --times \ --verify -v -Y /root/tmpdir/stress-ng/stress-statistic-12.yaml \ --log-file /root/tmpdir/stress-ng/stress-logfile-12.txt \ --temp-path /root/tmpdir/stress-ng/ We'll get the error log as below: [ 1151.015141] ============================================================================= [ 1151.016489] BUG kmalloc-16 (Not tainted): Right Redzone overwritten [ 1151.017486] ----------------------------------------------------------------------------- [ 1151.017486] [ 1151.018997] Disabling lock debugging due to kernel taint [ 1151.019873] INFO: 0x0000000084e43932-0x0000000098d17cae @offset=7392. First byte 0x0 instead of 0xcc [ 1151.021303] INFO: Allocated in __do_sys_swapon+0xcf6/0x1170 age=43417 cpu=9 pid=3816 [ 1151.022538] __slab_alloc+0xe/0x20 [ 1151.023069] __kmalloc_node+0xfd/0x4b0 [ 1151.023704] __do_sys_swapon+0xcf6/0x1170 [ 1151.024346] do_syscall_64+0x33/0x40 [ 1151.024925] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1151.025749] INFO: Freed in put_cred_rcu+0xa1/0xc0 age=43424 cpu=3 pid=2041 [ 1151.026889] kfree+0x276/0x2b0 [ 1151.027405] put_cred_rcu+0xa1/0xc0 [ 1151.027949] rcu_do_batch+0x17d/0x410 [ 1151.028566] rcu_core+0x14e/0x2b0 [ 1151.029084] __do_softirq+0x101/0x29e [ 1151.029645] asm_call_irq_on_stack+0x12/0x20 [ 1151.030381] do_softirq_own_stack+0x37/0x40 [ 1151.031037] do_softirq.part.15+0x2b/0x30 [ 1151.031710] __local_bh_enable_ip+0x4b/0x50 [ 1151.032412] copy_fpstate_to_sigframe+0x111/0x360 [ 1151.033197] __setup_rt_frame+0xce/0x480 [ 1151.033809] arch_do_signal+0x1a3/0x250 [ 1151.034463] exit_to_user_mode_prepare+0xcf/0x110 [ 1151.035242] syscall_exit_to_user_mode+0x27/0x190 [ 1151.035970] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1151.036795] INFO: Slab 0x000000003b9de4dc objects=44 used=9 fp=0x00000000539e349e flags=0xfffffc0010201 [ 1151.038323] INFO: Object 0x000000004855ba01 @offset=7376 fp=0x0000000000000000 [ 1151.038323] [ 1151.039683] Redzone 000000008d0afd3d: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc ................ [ 1151.041180] Object 000000004855ba01: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ [ 1151.042714] Redzone 0000000084e43932: 00 00 00 c0 cc cc cc cc ........ [ 1151.044120] Padding 000000000864c042: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZZZZZZZZZ [ 1151.045615] CPU: 5 PID: 3816 Comm: stress-ng Tainted: G B 5.10.50+ #7 [ 1151.046846] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 1151.048633] Call Trace: [ 1151.049072] dump_stack+0x57/0x6a [ 1151.049585] check_bytes_and_report+0xed/0x110 [ 1151.050320] check_object+0x1eb/0x290 [ 1151.050924] ? __x64_sys_swapoff+0x39a/0x540 [ 1151.051646] free_debug_processing+0x151/0x350 [ 1151.052333] __slab_free+0x21a/0x3a0 [ 1151.052938] ? _cond_resched+0x2d/0x40 [ 1151.053529] ? __vunmap+0x1de/0x220 [ 1151.054139] ? __x64_sys_swapoff+0x39a/0x540 [ 1151.054796] ? kfree+0x276/0x2b0 [ 1151.055307] kfree+0x276/0x2b0 [ 1151.055832] __x64_sys_swapoff+0x39a/0x540 [ 1151.056466] do_syscall_64+0x33/0x40 [ 1151.057084] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1151.057866] RIP: 0033:0x150340b0ffb7 [ 1151.058481] Code: Unable to access opcode bytes at RIP 0x150340b0ff8d. [ 1151.059537] RSP: 002b:00007fff7f4ee238 EFLAGS: 00000246 ORIG_RAX: 00000000000000a8 [ 1151.060768] RAX: ffffffffffffffda RBX: 00007fff7f4ee66c RCX: 0000150340b0ffb7 [ 1151.061904] RDX: 000000000000000a RSI: 0000000000018094 RDI: 00007fff7f4ee860 [ 1151.063033] RBP: 00007fff7f4ef980 R08: 0000000000000000 R09: 0000150340a672bd [ 1151.064135] R10: 00007fff7f4edca0 R11: 0000000000000246 R12: 0000000000018094 [ 1151.065253] R13: 0000000000000005 R14: 000000000160d930 R15: 00007fff7f4ee66c [ 1151.066413] FIX kmalloc-16: Restoring 0x0000000084e43932-0x0000000098d17cae=0xcc [ 1151.066413] [ 1151.067890] FIX kmalloc-16: Object at 0x000000004855ba01 not freed Fixes: 67482129cdab ("iomap: add a swapfile activation function") Fixes: a45c0eccc564 ("iomap: move the swapfile code into a separate file") Signed-off-by: Gang Deng <gavin.dg@linux.alibaba.com> Signed-off-by: Xu Yu <xuyu@linux.alibaba.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15nfsd4: Fix forced-expiry lockingJ. Bruce Fields
[ Upstream commit f7104cc1a9159cd0d3e8526cb638ae0301de4b61 ] This should use the network-namespace-wide client_lock, not the per-client cl_lock. You shouldn't see any bugs unless you're actually using the forced-expiry interface introduced by 89c905beccbb. Fixes: 89c905beccbb "nfsd: allow forced expiration of NFSv4 clients" Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15lockd: Fix invalid lockowner cast after vfs_test_lockBenjamin Coddington
[ Upstream commit cd2d644ddba183ec7b451b7c20d5c7cc06fcf0d7 ] After calling vfs_test_lock() the pointer to a conflicting lock can be returned, and that lock is not guarunteed to be owned by nlm. In that case, we cannot cast it to struct nlm_lockowner. Instead return the pid of that conflicting lock. Fixes: 646d73e91b42 ("lockd: Show pid of lockd for remote locks") Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15debugfs: Return error during {full/open}_proxy_open() on rmmodSven Eckelmann
[ Upstream commit 112cedc8e600b668688eb809bf11817adec58ddc ] If a kernel module gets unloaded then it printed report about a leak before commit 275678e7a9be ("debugfs: Check module state before warning in {full/open}_proxy_open()"). An additional check was added in this commit to avoid this printing. But it was forgotten that the function must return an error in this case because it was not actually opened. As result, the systems started to crash or to hang when a module was unloaded while something was trying to open a file. Fixes: 275678e7a9be ("debugfs: Check module state before warning in {full/open}_proxy_open()") Cc: Taehee Yoo <ap420073@gmail.com> Reported-by: Mário Lopes <ml@simonwunderlich.de> Signed-off-by: Sven Eckelmann <sven@narfation.org> Link: https://lore.kernel.org/r/20210802162444.7848-1-sven@narfation.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15gfs2: Fix memory leak of object lsi on error return pathColin Ian King
[ Upstream commit a6579cbfd7216b071008db13360c322a6b21400b ] In the case where IS_ERR(lsi->si_sc_inode) is true the error exit path to free_local does not kfree the allocated object lsi leading to a memory leak. Fix this by kfree'ing lst before taking the error exit path. Addresses-Coverity: ("Resource leak") Fixes: 97fd734ba17e ("gfs2: lookup local statfs inodes prior to journal recovery") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15udf_get_extendedattr() had no boundary checks.Stian Skjelstad
[ Upstream commit 58bc6d1be2f3b0ceecb6027dfa17513ec6aa2abb ] When parsing the ExtendedAttr data, malicous or corrupt attribute length could cause kernel hangs and buffer overruns in some special cases. Link: https://lore.kernel.org/r/20210822093332.25234-1-stian.skjelstad@gmail.com Signed-off-by: Stian Skjelstad <stian.skjelstad@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15fcntl: fix potential deadlock for &fasync_struct.fa_lockDesmond Cheong Zhi Xi
[ Upstream commit 2f488f698fda820f8e6fa0407630154eceb145d6 ] There is an existing lock hierarchy of &dev->event_lock --> &fasync_struct.fa_lock --> &f->f_owner.lock from the following call chain: input_inject_event(): spin_lock_irqsave(&dev->event_lock,...); input_handle_event(): input_pass_values(): input_to_handler(): evdev_events(): evdev_pass_values(): spin_lock(&client->buffer_lock); __pass_event(): kill_fasync(): kill_fasync_rcu(): read_lock(&fa->fa_lock); send_sigio(): read_lock_irqsave(&fown->lock,...); &dev->event_lock is HARDIRQ-safe, so interrupts have to be disabled while grabbing &fasync_struct.fa_lock, otherwise we invert the lock hierarchy. However, since kill_fasync which calls kill_fasync_rcu is an exported symbol, it may not necessarily be called with interrupts disabled. As kill_fasync_rcu may be called with interrupts disabled (for example, in the call chain above), we replace calls to read_lock/read_unlock on &fasync_struct.fa_lock in kill_fasync_rcu with read_lock_irqsave/read_unlock_irqrestore. Signed-off-by: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15isofs: joliet: Fix iocharset=utf8 mount optionPali Rohár
[ Upstream commit 28ce50f8d96ec9035f60c9348294ea26b94db944 ] Currently iocharset=utf8 mount option is broken. To use UTF-8 as iocharset, it is required to use utf8 mount option. Fix iocharset=utf8 mount option to use be equivalent to the utf8 mount option. If UTF-8 as iocharset is used then s_nls_iocharset is set to NULL. So simplify code around, remove s_utf8 field as to distinguish between UTF-8 and non-UTF-8 it is needed just to check if s_nls_iocharset is set to NULL or not. Link: https://lore.kernel.org/r/20210808162453.1653-5-pali@kernel.org Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15udf: Fix iocharset=utf8 mount optionPali Rohár
[ Upstream commit b645333443712d2613e4e863f81090d5dc509657 ] Currently iocharset=utf8 mount option is broken. To use UTF-8 as iocharset, it is required to use utf8 mount option. Fix iocharset=utf8 mount option to use be equivalent to the utf8 mount option. If UTF-8 as iocharset is used then s_nls_map is set to NULL. So simplify code around, remove UDF_FLAG_NLS_MAP and UDF_FLAG_UTF8 flags as to distinguish between UTF-8 and non-UTF-8 it is needed just to check if s_nls_map set to NULL or not. Link: https://lore.kernel.org/r/20210808162453.1653-4-pali@kernel.org Signed-off-by: Pali Rohár <pali@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-15udf: Check LVID earlierJan Kara
[ Upstream commit 781d2a9a2fc7d0be53a072794dc03ef6de770f3d ] We were checking validity of LVID entries only when getting implementation use information from LVID in udf_sb_lvidiu(). However if the LVID is suitably corrupted, it can cause problems also to code such as udf_count_free() which doesn't use udf_sb_lvidiu(). So check validity of LVID already when loading it from the disk and just disable LVID altogether when it is not valid. Reported-by: syzbot+7fbfe5fed73ebb675748@syzkaller.appspotmail.com Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-08fuse: fix illegal access to inode with reused nodeidAmir Goldstein
commit 15db16837a35d8007cb8563358787412213db25e upstream. Server responds to LOOKUP and other ops (READDIRPLUS/CREATE/MKNOD/...) with ourarg containing nodeid and generation. If a fuse inode is found in inode cache with the same nodeid but different generation, the existing fuse inode should be unhashed and marked "bad" and a new inode with the new generation should be hashed instead. This can happen, for example, with passhrough fuse filesystem that returns the real filesystem ino/generation on lookup and where real inode numbers can get recycled due to real files being unlinked not via the fuse passthrough filesystem. With current code, this situation will not be detected and an old fuse dentry that used to point to an older generation real inode, can be used to access a completely new inode, which should be accessed only via the new dentry. Note that because the FORGET message carries the nodeid w/o generation, the server should wait to get FORGET counts for the nlookup counts of the old and reused inodes combined, before it can free the resources associated to that nodeid. Stable backport notes: * This is not a regression. The bug has been in fuse forever, but only a certain class of low level fuse filesystems can trigger this bug * Because there is no way to check if this fix is applied in runtime, libfuse test_examples.py tests this fix with hardcoded check for kernel version >= 5.14 * After backport to stable kernel(s), the libfuse test can be updated to also check minimal stable kernel version(s) * Depends on "fuse: fix bad inode" which is already applied to stable kernels v5.4.y and v5.10.y * Required backporting helper inode_wrong_type() Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-fsdevel/CAOQ4uxi8DymG=JO_sAU+wS8akFdzh+PuXwW3Ebgahd2Nwnh7zA@mail.gmail.com/ Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08new helper: inode_wrong_type()Al Viro
commit 6e3e2c4362e41a2f18e3f7a5ad81bd2f49a47b85 upstream. inode_wrong_type(inode, mode) returns true if setting inode->i_mode to given value would've changed the inode type. We have enough of those checks open-coded to make a helper worthwhile. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08ceph: fix possible null-pointer dereference in ceph_mdsmap_decode()Tuo Li
[ Upstream commit a9e6ffbc5b7324b6639ee89028908b1e91ceed51 ] kcalloc() is called to allocate memory for m->m_info, and if it fails, ceph_mdsmap_destroy() behind the label out_err will be called: ceph_mdsmap_destroy(m); In ceph_mdsmap_destroy(), m->m_info is dereferenced through: kfree(m->m_info[i].export_targets); To fix this possible null-pointer dereference, check m->m_info before the for loop to free m->m_info[i].export_targets. [ jlayton: fix up whitespace damage only kfree(m->m_info) if it's non-NULL ] Reported-by: TOTE Robot <oslab@tsinghua.edu.cn> Signed-off-by: Tuo Li <islituo@gmail.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-09-08Revert "Add a reference to ucounts for each cred"Greg Kroah-Hartman
This reverts commit b2c4d9a33cc2dec7466f97eba2c4dd571ad798a5 which is commit 905ae01c4ae2ae3df05bb141801b1db4b7d83c61 upstream. This commit should not have been applied to the 5.10.y stable tree, so revert it. Reported-by: "Eric W. Biederman" <ebiederm@xmission.com> Link: https://lore.kernel.org/r/87v93k4bl6.fsf@disp2133 Cc: Alexey Gladkov <legion@kernel.org> Cc: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08ubifs: report correct st_size for encrypted symlinksEric Biggers
commit 064c734986011390b4d111f1a99372b7f26c3850 upstream. The stat() family of syscalls report the wrong size for encrypted symlinks, which has caused breakage in several userspace programs. Fix this by calling fscrypt_symlink_getattr() after ubifs_getattr() for encrypted symlinks. This function computes the correct size by reading and decrypting the symlink target (if it's not already cached). For more details, see the commit which added fscrypt_symlink_getattr(). Fixes: ca7f85be8d6c ("ubifs: Add support for encrypted symlinks") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210702065350.209646-5-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08f2fs: report correct st_size for encrypted symlinksEric Biggers
commit 461b43a8f92e68e96c4424b31e15f2b35f1bbfa9 upstream. The stat() family of syscalls report the wrong size for encrypted symlinks, which has caused breakage in several userspace programs. Fix this by calling fscrypt_symlink_getattr() after f2fs_getattr() for encrypted symlinks. This function computes the correct size by reading and decrypting the symlink target (if it's not already cached). For more details, see the commit which added fscrypt_symlink_getattr(). Fixes: cbaf042a3cc6 ("f2fs crypto: add symlink encryption") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210702065350.209646-4-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08ext4: report correct st_size for encrypted symlinksEric Biggers
commit 8c4bca10ceafc43b1ca0a9fab5fa27e13cbce99e upstream. The stat() family of syscalls report the wrong size for encrypted symlinks, which has caused breakage in several userspace programs. Fix this by calling fscrypt_symlink_getattr() after ext4_getattr() for encrypted symlinks. This function computes the correct size by reading and decrypting the symlink target (if it's not already cached). For more details, see the commit which added fscrypt_symlink_getattr(). Fixes: f348c252320b ("ext4 crypto: add symlink encryption") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210702065350.209646-3-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08fscrypt: add fscrypt_symlink_getattr() for computing st_sizeEric Biggers
commit d18760560593e5af921f51a8c9b64b6109d634c2 upstream. Add a helper function fscrypt_symlink_getattr() which will be called from the various filesystems' ->getattr() methods to read and decrypt the target of encrypted symlinks in order to report the correct st_size. Detailed explanation: As required by POSIX and as documented in various man pages, st_size for a symlink is supposed to be the length of the symlink target. Unfortunately, st_size has always been wrong for encrypted symlinks because st_size is populated from i_size from disk, which intentionally contains the length of the encrypted symlink target. That's slightly greater than the length of the decrypted symlink target (which is the symlink target that userspace usually sees), and usually won't match the length of the no-key encoded symlink target either. This hadn't been fixed yet because reporting the correct st_size would require reading the symlink target from disk and decrypting or encoding it, which historically has been considered too heavyweight to do in ->getattr(). Also historically, the wrong st_size had only broken a test (LTP lstat03) and there were no known complaints from real users. (This is probably because the st_size of symlinks isn't used too often, and when it is, typically it's for a hint for what buffer size to pass to readlink() -- which a slightly-too-large size still works for.) However, a couple things have changed now. First, there have recently been complaints about the current behavior from real users: - Breakage in rpmbuild: https://github.com/rpm-software-management/rpm/issues/1682 https://github.com/google/fscrypt/issues/305 - Breakage in toybox cpio: https://www.mail-archive.com/toybox@lists.landley.net/msg07193.html - Breakage in libgit2: https://issuetracker.google.com/issues/189629152 (on Android public issue tracker, requires login) Second, we now cache decrypted symlink targets in ->i_link. Therefore, taking the performance hit of reading and decrypting the symlink target in ->getattr() wouldn't be as big a deal as it used to be, since usually it will just save having to do the same thing later. Also note that eCryptfs ended up having to read and decrypt symlink targets in ->getattr() as well, to fix this same issue; see commit 3a60a1686f0d ("eCryptfs: Decrypt symlink target for stat size"). So, let's just bite the bullet, and read and decrypt the symlink target in ->getattr() in order to report the correct st_size. Add a function fscrypt_symlink_getattr() which the filesystems will call to do this. (Alternatively, we could store the decrypted size of symlinks on-disk. But there isn't a great place to do so, and encryption is meant to hide the original size to some extent; that property would be lost.) Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210702065350.209646-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-08ext4: fix race writing to an inline_data file while its xattrs are changingTheodore Ts'o
commit a54c4613dac1500b40e4ab55199f7c51f028e848 upstream. The location of the system.data extended attribute can change whenever xattr_sem is not taken. So we need to recalculate the i_inline_off field since it mgiht have changed between ext4_write_begin() and ext4_write_end(). This means that caching i_inline_off is probably not helpful, so in the long run we should probably get rid of it and shrink the in-memory ext4 inode slightly, but let's fix the race the simple way for now. Cc: stable@kernel.org Fixes: f19d5870cbf72 ("ext4: add normal write support for inline data") Reported-by: syzbot+13146364637c7363a7de@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03btrfs: fix NULL pointer dereference when deleting device by invalid idQu Wenruo
commit e4571b8c5e9ffa1e85c0c671995bd4dcc5c75091 upstream. [BUG] It's easy to trigger NULL pointer dereference, just by removing a non-existing device id: # mkfs.btrfs -f -m single -d single /dev/test/scratch1 \ /dev/test/scratch2 # mount /dev/test/scratch1 /mnt/btrfs # btrfs device remove 3 /mnt/btrfs Then we have the following kernel NULL pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 9 PID: 649 Comm: btrfs Not tainted 5.14.0-rc3-custom+ #35 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_rm_device+0x4de/0x6b0 [btrfs] btrfs_ioctl+0x18bb/0x3190 [btrfs] ? lock_is_held_type+0xa5/0x120 ? find_held_lock.constprop.0+0x2b/0x80 ? do_user_addr_fault+0x201/0x6a0 ? lock_release+0xd2/0x2d0 ? __x64_sys_ioctl+0x83/0xb0 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae [CAUSE] Commit a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") moves the "missing" device path check into btrfs_rm_device(). But btrfs_rm_device() itself can have case where it only receives @devid, with NULL as @device_path. In that case, calling strcmp() on NULL will trigger the NULL pointer dereference. Before that commit, we handle the "missing" case inside btrfs_find_device_by_devspec(), which will not check @device_path at all if @devid is provided, thus no way to trigger the bug. [FIX] Before calling strcmp(), also make sure @device_path is not NULL. Fixes: a27a94c2b0c7 ("btrfs: Make btrfs_find_device_by_devspec return btrfs_device directly") CC: stable@vger.kernel.org # 5.4+ Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03pipe: do FASYNC notifications for every pipe IO, not just state changesLinus Torvalds
commit fe67f4dd8daa252eb9aa7acb61555f3cc3c1ce4c upstream. It turns out that the SIGIO/FASYNC situation is almost exactly the same as the EPOLLET case was: user space really wants to be notified after every operation. Now, in a perfect world it should be sufficient to only notify user space on "state transitions" when the IO state changes (ie when a pipe goes from unreadable to readable, or from unwritable to writable). User space should then do as much as possible - fully emptying the buffer or what not - and we'll notify it again the next time the state changes. But as with EPOLLET, we have at least one case (stress-ng) where the kernel sent SIGIO due to the pipe being marked for asynchronous notification, but the user space signal handler then didn't actually necessarily read it all before returning (it read more than what was written, but since there could be multiple writes, it could leave data pending). The user space code then expected to get another SIGIO for subsequent writes - even though the pipe had been readable the whole time - and would only then read more. This is arguably a user space bug - and Colin King already fixed the stress-ng code in question - but the kernel regression rules are clear: it doesn't matter if kernel people think that user space did something silly and wrong. What matters is that it used to work. So if user space depends on specific historical kernel behavior, it's a regression when that behavior changes. It's on us: we were silly to have that non-optimal historical behavior, and our old kernel behavior was what user space was tested against. Because of how the FASYNC notification was tied to wakeup behavior, this was first broken by commits f467a6a66419 and 1b6b26ae7053 ("pipe: fix and clarify pipe read/write wakeup logic"), but at the time it seems nobody noticed. Probably because the stress-ng problem case ends up being timing-dependent too. It was then unwittingly fixed by commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers") only to be broken again when by commit 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads"). And at that point the kernel test robot noticed the performance refression in the stress-ng.sigio.ops_per_sec case. So the "Fixes" tag below is somewhat ad hoc, but it matches when the issue was noticed. Fix it for good (knock wood) by simply making the kill_fasync() case separate from the wakeup case. FASYNC is quite rare, and we clearly shouldn't even try to use the "avoid unnecessary wakeups" logic for it. Link: https://lore.kernel.org/lkml/20210824151337.GC27667@xsang-OptiPlex-9020/ Fixes: 3b844826b6c6 ("pipe: avoid unnecessary EPOLLET wakeups under normal loads") Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Oliver Sang <oliver.sang@intel.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Colin Ian King <colin.king@canonical.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03pipe: avoid unnecessary EPOLLET wakeups under normal loadsLinus Torvalds
commit 3b844826b6c6affa80755254da322b017358a2f4 upstream. I had forgotten just how sensitive hackbench is to extra pipe wakeups, and commit 3a34b13a88ca ("pipe: make pipe writes always wake up readers") ended up causing a quite noticeable regression on larger machines. Now, hackbench isn't necessarily a hugely meaningful benchmark, and it's not clear that this matters in real life all that much, but as Mel points out, it's used often enough when comparing kernels and so the performance regression shows up like a sore thumb. It's easy enough to fix at least for the common cases where pipes are used purely for data transfer, and you never have any exciting poll usage at all. So set a special 'poll_usage' flag when there is polling activity, and make the ugly "EPOLLET has crazy legacy expectations" semantics explicit to only that case. I would love to limit it to just the broken EPOLLET case, but the pipe code can't see the difference between epoll and regular select/poll, so any non-read/write waiting will trigger the extra wakeup behavior. That is sufficient for at least the hackbench case. Apart from making the odd extra wakeup cases more explicitly about EPOLLET, this also makes the extra wakeup be at the _end_ of the pipe write, not at the first write chunk. That is actually much saner semantics (as much as you can call any of the legacy edge-triggered expectations for EPOLLET "sane") since it means that you know the wakeup will happen once the write is done, rather than possibly in the middle of one. [ For stable people: I'm putting a "Fixes" tag on this, but I leave it up to you to decide whether you actually want to backport it or not. It likely has no impact outside of synthetic benchmarks - Linus ] Link: https://lore.kernel.org/lkml/20210802024945.GA8372@xsang-OptiPlex-9020/ Fixes: 3a34b13a88ca ("pipe: make pipe writes always wake up readers") Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: Sandeep Patil <sspatil@android.com> Tested-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03btrfs: fix race between marking inode needs to be logged and log syncingFilipe Manana
commit bc0939fcfab0d7efb2ed12896b1af3d819954a14 upstream. We have a race between marking that an inode needs to be logged, either at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between btrfs_sync_log(). The following steps describe how the race happens. 1) We are at transaction N; 2) Inode I was previously fsynced in the current transaction so it has: inode->logged_trans set to N; 3) The inode's root currently has: root->log_transid set to 1 root->last_log_commit set to 0 Which means only one log transaction was committed to far, log transaction 0. When a log tree is created we set ->log_transid and ->last_log_commit of its parent root to 0 (at btrfs_add_log_tree()); 4) One more range of pages is dirtied in inode I; 5) Some task A starts an fsync against some other inode J (same root), and so it joins log transaction 1. Before task A calls btrfs_sync_log()... 6) Task B starts an fsync against inode I, which currently has the full sync flag set, so it starts delalloc and waits for the ordered extent to complete before calling btrfs_inode_in_log() at btrfs_sync_file(); 7) During ordered extent completion we have btrfs_update_inode() called against inode I, which in turn calls btrfs_set_inode_last_trans(), which does the following: spin_lock(&inode->lock); inode->last_trans = trans->transaction->transid; inode->last_sub_trans = inode->root->log_transid; inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); So ->last_trans is set to N and ->last_sub_trans set to 1. But before setting ->last_log_commit... 8) Task A is at btrfs_sync_log(): - it increments root->log_transid to 2 - starts writeback for all log tree extent buffers - waits for the writeback to complete - writes the super blocks - updates root->last_log_commit to 1 It's a lot of slow steps between updating root->log_transid and root->last_log_commit; 9) The task doing the ordered extent completion, currently at btrfs_set_inode_last_trans(), then finally runs: inode->last_log_commit = inode->root->last_log_commit; spin_unlock(&inode->lock); Which results in inode->last_log_commit being set to 1. The ordered extent completes; 10) Task B is resumed, and it calls btrfs_inode_in_log() which returns true because we have all the following conditions met: inode->logged_trans == N which matches fs_info->generation && inode->last_subtrans (1) <= inode->last_log_commit (1) && inode->last_subtrans (1) <= root->last_log_commit (1) && list inode->extent_tree.modified_extents is empty And as a consequence we return without logging the inode, so the existing logged version of the inode does not point to the extent that was written after the previous fsync. It should be impossible in practice for one task be able to do so much progress in btrfs_sync_log() while another task is at btrfs_set_inode_last_trans() right after it reads root->log_transid and before it reads root->last_log_commit. Even if kernel preemption is enabled we know the task at btrfs_set_inode_last_trans() can not be preempted because it is holding the inode's spinlock. However there is another place where we do the same without holding the spinlock, which is in the memory mapped write path at: vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf) { (...) BTRFS_I(inode)->last_trans = fs_info->generation; BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid; BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit; (...) So with preemption happening after setting ->last_sub_trans and before setting ->last_log_commit, it is less of a stretch to have another task do enough progress at btrfs_sync_log() such that the task doing the memory mapped write ends up with ->last_sub_trans and ->last_log_commit set to the same value. It is still a big stretch to get there, as the task doing btrfs_sync_log() has to start writeback, wait for its completion and write the super blocks. So fix this in two different ways: 1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the value of ->last_sub_trans minus 1; 2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just like we do for buffered and direct writes at btrfs_file_write_iter(), which is all we need to make sure multiple writes and fsyncs to an inode in the same transaction never result in an fsync missing that the inode changed and needs to be logged. Turn this into a helper function and use it both at btrfs_page_mkwrite() and at btrfs_file_write_iter() - this also fixes the problem that at btrfs_page_mkwrite() we were setting those fields without the protection of the inode's spinlock. This is an extremely unlikely race to happen in practice. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03Revert "btrfs: compression: don't try to compress if we don't have enough pages"Qu Wenruo
commit 4e9655763b82a91e4c341835bb504a2b1590f984 upstream. This reverts commit f2165627319ffd33a6217275e5690b1ab5c45763. [BUG] It's no longer possible to create compressed inline extent after commit f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages"). [CAUSE] For compression code, there are several possible reasons we have a range that needs to be compressed while it's no more than one page. - Compressed inline write The data is always smaller than one sector and the test lacks the condition to properly recognize a non-inline extent. - Compressed subpage write For the incoming subpage compressed write support, we require page alignment of the delalloc range. And for 64K page size, we can compress just one page into smaller sectors. For those reasons, the requirement for the data to be more than one page is not correct, and is already causing regression for compressed inline data writeback. The idea of skipping one page to avoid wasting CPU time could be revisited in the future. [FIX] Fix it by reverting the offending commit. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Link: https://lore.kernel.org/linux-btrfs/afa2742.c084f5d6.17b6b08dffc@tnonline.net Fixes: f2165627319f ("btrfs: compression: don't try to compress if we don't have enough pages") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03ceph: correctly handle releasing an embedded cap flushXiubo Li
commit b2f9fa1f3bd8846f50b355fc2168236975c4d264 upstream. The ceph_cap_flush structures are usually dynamically allocated, but the ceph_cap_snap has an embedded one. When force umounting, the client will try to remove all the session caps. During this, it will free them, but that should not be done with the ones embedded in a capsnap. Fix this by adding a new boolean that indicates that the cap flush is embedded in a capsnap, and skip freeing it if that's set. At the same time, switch to using list_del_init() when detaching the i_list and g_list heads. It's possible for a forced umount to remove these objects but then handle_cap_flushsnap_ack() races in and does the list_del_init() again, corrupting memory. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/52283 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-09-03ovl: fix uninitialized pointer read in ovl_lookup_real_one()Miklos Szeredi
[ Upstream commit 580c610429b3994e8db24418927747cf28443cde ] One error path can result in release_dentry_name_snapshot() being called before "name" was initialized by take_dentry_name_snapshot(). Fix by moving the release_dentry_name_snapshot() to immediately after the only use. Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26io_uring: only assign io_uring_enter() SQPOLL error in actual error caseJens Axboe
[ upstream commit 21f965221e7c42609521342403e8fb91b8b3e76e ] If an SQPOLL based ring is newly created and an application issues an io_uring_enter(2) system call on it, then we can return a spurious -EOWNERDEAD error. This happens because there's nothing to submit, and if the caller doesn't specify any other action, the initial error assignment of -EOWNERDEAD never gets overwritten. This causes us to return it directly, even if it isn't valid. Move the error assignment into the actual failure case instead. Cc: stable@vger.kernel.org Fixes: d9d05217cb69 ("io_uring: stop SQPOLL submit on creator's death") Reported-by: Sherlock Holo sherlockya@gmail.com Link: https://github.com/axboe/liburing/issues/413 Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26io_uring: fix xa_alloc_cycle() error return value checkJens Axboe
[ upstream commit a30f895ad3239f45012e860d4f94c1a388b36d14 ] We currently check for ret != 0 to indicate error, but '1' is a valid return and just indicates that the allocation succeeded with a wrap. Correct the check to be for < 0, like it was before the xarray conversion. Cc: stable@vger.kernel.org Fixes: 61cf93700fe6 ("io_uring: Convert personality_idr to XArray") Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26fs: warn about impending deprecation of mandatory locksJeff Layton
[ Upstream commit fdd92b64d15bc4aec973caa25899afd782402e68 ] We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros have disabled it. Warn the stragglers that still use "-o mand" that we'll be dropping support for that mount option. Cc: stable@vger.kernel.org Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-26btrfs: prevent rename2 from exchanging a subvol with a directory from ↵NeilBrown
different parents [ Upstream commit 3f79f6f6247c83f448c8026c3ee16d4636ef8d4f ] Cross-rename lacks a check when that would prevent exchanging a directory and subvolume from different parent subvolume. This causes data inconsistencies and is caught before commit by tree-checker, turning the filesystem to read-only. Calling the renameat2 with RENAME_EXCHANGE flags like renameat2(AT_FDCWD, namesrc, AT_FDCWD, namedest, (1 << 1)) on two paths: namesrc = dir1/subvol1/dir2 namedest = subvol2/subvol3 will cause key order problem with following write time tree-checker report: [1194842.307890] BTRFS critical (device loop1): corrupt leaf: root=5 block=27574272 slot=10 ino=258, invalid previous key objectid, have 257 expect 258 [1194842.322221] BTRFS info (device loop1): leaf 27574272 gen 8 total ptrs 11 free space 15444 owner 5 [1194842.331562] BTRFS info (device loop1): refs 2 lock_owner 0 current 26561 [1194842.338772] item 0 key (256 1 0) itemoff 16123 itemsize 160 [1194842.338793] inode generation 3 size 16 mode 40755 [1194842.338801] item 1 key (256 12 256) itemoff 16111 itemsize 12 [1194842.338809] item 2 key (256 84 2248503653) itemoff 16077 itemsize 34 [1194842.338817] dir oid 258 type 2 [1194842.338823] item 3 key (256 84 2363071922) itemoff 16043 itemsize 34 [1194842.338830] dir oid 257 type 2 [1194842.338836] item 4 key (256 96 2) itemoff 16009 itemsize 34 [1194842.338843] item 5 key (256 96 3) itemoff 15975 itemsize 34 [1194842.338852] item 6 key (257 1 0) itemoff 15815 itemsize 160 [1194842.338863] inode generation 6 size 8 mode 40755 [1194842.338869] item 7 key (257 12 256) itemoff 15801 itemsize 14 [1194842.338876] item 8 key (257 84 2505409169) itemoff 15767 itemsize 34 [1194842.338883] dir oid 256 type 2 [1194842.338888] item 9 key (257 96 2) itemoff 15733 itemsize 34 [1194842.338895] item 10 key (258 12 256) itemoff 15719 itemsize 14 [1194842.339163] BTRFS error (device loop1): block=27574272 write time tree block corruption detected [1194842.339245] ------------[ cut here ]------------ [1194842.443422] WARNING: CPU: 6 PID: 26561 at fs/btrfs/disk-io.c:449 csum_one_extent_buffer+0xed/0x100 [btrfs] [1194842.511863] CPU: 6 PID: 26561 Comm: kworker/u17:2 Not tainted 5.14.0-rc3-git+ #793 [1194842.511870] Hardware name: empty empty/S3993, BIOS PAQEX0-3 02/24/2008 [1194842.511876] Workqueue: btrfs-worker-high btrfs_work_helper [btrfs] [1194842.511976] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs] [1194842.512068] RSP: 0018:ffffa2c284d77da0 EFLAGS: 00010282 [1194842.512074] RAX: 0000000000000000 RBX: 0000000000001000 RCX: ffff928867bd9978 [1194842.512078] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff928867bd9970 [1194842.512081] RBP: ffff92876b958000 R08: 0000000000000001 R09: 00000000000c0003 [1194842.512085] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 [1194842.512088] R13: ffff92875f989f98 R14: 0000000000000000 R15: 0000000000000000 [1194842.512092] FS: 0000000000000000(0000) GS:ffff928867a00000(0000) knlGS:0000000000000000 [1194842.512095] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1194842.512099] CR2: 000055f5384da1f0 CR3: 0000000102fe4000 CR4: 00000000000006e0 [1194842.512103] Call Trace: [1194842.512128] ? run_one_async_free+0x10/0x10 [btrfs] [1194842.631729] btree_csum_one_bio+0x1ac/0x1d0 [btrfs] [1194842.631837] run_one_async_start+0x18/0x30 [btrfs] [1194842.631938] btrfs_work_helper+0xd5/0x1d0 [btrfs] [1194842.647482] process_one_work+0x262/0x5e0 [1194842.647520] worker_thread+0x4c/0x320 [1194842.655935] ? process_one_work+0x5e0/0x5e0 [1194842.655946] kthread+0x135/0x160 [1194842.655953] ? set_kthread_struct+0x40/0x40 [1194842.655965] ret_from_fork+0x1f/0x30 [1194842.672465] irq event stamp: 1729 [1194842.672469] hardirqs last enabled at (1735): [<ffffffffbd1104f5>] console_trylock_spinning+0x185/0x1a0 [1194842.672477] hardirqs last disabled at (1740): [<ffffffffbd1104cc>] console_trylock_spinning+0x15c/0x1a0 [1194842.672482] softirqs last enabled at (1666): [<ffffffffbdc002e1>] __do_softirq+0x2e1/0x50a [1194842.672491] softirqs last disabled at (1651): [<ffffffffbd08aab7>] __irq_exit_rcu+0xa7/0xd0 The corrupted data will not be written, and filesystem can be unmounted and mounted again (all changes since the last commit will be lost). Add the missing check for new_ino so that all non-subvolumes must reside under the same parent subvolume. There's an exception allowing to exchange two subvolumes from any parents as the directory representing a subvolume is only a logical link and does not have any other structures related to the parent subvolume, unlike files, directories etc, that are always in the inode namespace of the parent subvolume. Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT") CC: stable@vger.kernel.org # 4.7+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-18ceph: take snap_empty_lock atomically with snaprealm refcount changeJeff Layton
commit 8434ffe71c874b9c4e184b88d25de98c2bf5fe3f upstream. There is a race in ceph_put_snap_realm. The change to the nref and the spinlock acquisition are not done atomically, so you could decrement nref, and before you take the spinlock, the nref is incremented again. At that point, you end up putting it on the empty list when it shouldn't be there. Eventually __cleanup_empty_realms runs and frees it when it's still in-use. Fix this by protecting the 1->0 transition with atomic_dec_and_lock, and just drop the spinlock if we can get the rwsem. Because these objects can also undergo a 0->1 refcount transition, we must protect that change as well with the spinlock. Increment locklessly unless the value is at 0, in which case we take the spinlock, increment and then take it off the empty list if it did the 0->1 transition. With these changes, I'm removing the dout() messages from these functions, as well as in __put_snap_realm. They've always been racy, and it's better to not print values that may be misleading. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46419 Reported-by: Mark Nelson <mnelson@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Luis Henriques <lhenriques@suse.de> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18ceph: clean up locking annotation for ceph_get_snap_realm and ↵Jeff Layton
__lookup_snap_realm commit df2c0cb7f8e8c83e495260ad86df8c5da947f2a7 upstream. They both say that the snap_rwsem must be held for write, but I don't see any real reason for it, and it's not currently always called that way. The lookup is just walking the rbtree, so holding it for read should be fine there. The "get" is bumping the refcount and (possibly) removing it from the empty list. I see no need to hold the snap_rwsem for write for that. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18ceph: add some lockdep assertions around snaprealm handlingJeff Layton
commit a6862e6708c15995bc10614b2ef34ca35b4b9078 upstream. Turn some comments into lockdep asserts. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18vboxsf: Add support for the atomic_open directory-inode opHans de Goede
commit 52dfd86aa568e433b24357bb5fc725560f1e22d8 upstream. Opening a new file is done in 2 steps on regular filesystems: 1. Call the create inode-op on the parent-dir to create an inode to hold the meta-data related to the file. 2. Call the open file-op to get a handle for the file. vboxsf however does not really use disk-backed inodes because it is based on passing through file-related system-calls through to the hypervisor. So both steps translate to an open(2) call being passed through to the hypervisor. With the handle returned by the first call immediately being closed again. Making 2 open calls for a single open(..., O_CREATE, ...) calls has 2 problems: a) It is not really efficient. b) It actually breaks some apps. An example of b) is doing a git clone inside a vboxsf mount. When git clone tries to create a tempfile to store the pak files which is downloading the following happens: 1. vboxsf_dir_mkfile() gets called with a mode of 0444 and succeeds. 2. vboxsf_file_open() gets called with file->f_flags containing O_RDWR. When the host is a Linux machine this fails because doing a open(..., O_RDWR) on a file which exists and has mode 0444 results in an -EPERM error. Other network-filesystems and fuse avoid the problem of needing to pass 2 open() calls to the other side by using the atomic_open directory-inode op. This commit fixes git clone not working inside a vboxsf mount, by adding support for the atomic_open directory-inode op. As an added bonus this should also make opening new files faster. The atomic_open implementation is modelled after the atomic_open implementations from the 9p and fuse code. Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support") Reported-by: Ludovic Pouzenc <bugreports@pouzenc.fr> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18vboxsf: Add vboxsf_[create|release]_sf_handle() helpersHans de Goede
commit 02f840f90764f22f5c898901849bdbf0cee752ba upstream. Factor out the code to create / release a struct vboxsf_handle into 2 new helper functions. This is a preparation patch for adding atomic_open support. Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support") Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18ceph: reduce contention in ceph_check_delayed_caps()Luis Henriques
commit bf2ba432213fade50dd39f2e348085b758c0726e upstream. Function ceph_check_delayed_caps() is called from the mdsc->delayed_work workqueue and it can be kept looping for quite some time if caps keep being added back to the mdsc->cap_delay_list. This may result in the watchdog tainting the kernel with the softlockup flag. This patch breaks this loop if the caps have been recently (i.e. during the loop execution). Any new caps added to the list will be handled in the next run. Also, allow schedule_delayed() callers to explicitly set the delay value instead of defaulting to 5s, so we can ensure that it runs soon afterward if it looks like there is more work. Cc: stable@vger.kernel.org URL: https://tracker.ceph.com/issues/46284 Signed-off-by: Luis Henriques <lhenriques@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-18cifs: create sd context must be a multiple of 8Shyam Prasad N
commit 7d3fc01796fc895e5fcce45c994c5a8db8120a8d upstream. We used to follow the rule earlier that the create SD context always be a multiple of 8. However, with the change: cifs: refactor create_sd_buf() and and avoid corrupting the buffer ...we recompute the length, and we failed that rule. Fixing that with this change. Cc: <stable@vger.kernel.org> # v5.10+ Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-15ovl: prevent private clone if bind mount is not allowedMiklos Szeredi
commit 427215d85e8d1476da1a86b8d67aceb485eb3631 upstream. Add the following checks from __do_loopback() to clone_private_mount() as well: - verify that the mount is in the current namespace - verify that there are no locked children Reported-by: Alois Wohlschlager <alois1@gmx-topmail.de> Fixes: c771d683a62e ("vfs: introduce clone_private_mount()") Cc: <stable@vger.kernel.org> # v3.18 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-15vboxsf: Make vboxsf_dir_create() return the handle for the created fileHans de Goede
commit ab0c29687bc7a890d1a86ac376b0b0fd78b2d9b6 upstream Make vboxsf_dir_create() optionally return the vboxsf-handle for the created file. This is a preparation patch for adding atomic_open support. Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support") Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-15vboxsf: Honor excl flag to the dir-inode create opHans de Goede
commit cc3ddee97cff034cea4d095de4a484c92a219bf5 upstream Honor the excl flag to the dir-inode create op, instead of behaving as if it is always set. Note the old behavior still worked most of the time since a non-exclusive open only calls the create op, if there is a race and the file is created between the dentry lookup and the calling of the create call. While at it change the type of the is_dir parameter to the vboxsf_dir_create() helper from an int to a bool, to be consistent with the use of bool for the excl parameter. Fixes: 0fd169576648 ("fs: Add VirtualBox guest shared folder (vboxsf) support") Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2021-08-12smb3: rc uninitialized in one fallocate pathSteve French
[ Upstream commit 5ad4df56cd2158965f73416d41fce37906724822 ] Clang detected a problem with rc possibly being unitialized (when length is zero) in a recently added fallocate code path. Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-12reiserfs: check directory items on read from diskShreyansh Chouhan
[ Upstream commit 13d257503c0930010ef9eed78b689cec417ab741 ] While verifying the leaf item that we read from the disk, reiserfs doesn't check the directory items, this could cause a crash when we read a directory item from the disk that has an invalid deh_location. This patch adds a check to the directory items read from the disk that does a bounds check on deh_location for the directory entries. Any directory entry header with a directory entry offset greater than the item length is considered invalid. Link: https://lore.kernel.org/r/20210709152929.766363-1-chouhan.shreyansh630@gmail.com Reported-by: syzbot+c31a48e6702ccb3d64c9@syzkaller.appspotmail.com Signed-off-by: Shreyansh Chouhan <chouhan.shreyansh630@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-12reiserfs: add check for root_inode in reiserfs_fill_superYu Kuai
[ Upstream commit 2acf15b94d5b8ea8392c4b6753a6ffac3135cd78 ] Our syzcaller report a NULL pointer dereference: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 116e95067 P4D 116e95067 PUD 1080b5067 PMD 0 Oops: 0010 [#1] SMP KASAN CPU: 7 PID: 592 Comm: a.out Not tainted 5.13.0-next-20210629-dirty #67 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-p4 RIP: 0010:0x0 Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. RSP: 0018:ffff888114e779b8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 1ffff110229cef39 RCX: ffffffffaa67e1aa RDX: 0000000000000000 RSI: ffff88810a58ee00 RDI: ffff8881233180b0 RBP: ffffffffac38e9c0 R08: ffffffffaa67e17e R09: 0000000000000001 R10: ffffffffb91c5557 R11: fffffbfff7238aaa R12: ffff88810a58ee00 R13: ffff888114e77aa0 R14: 0000000000000000 R15: ffff8881233180b0 FS: 00007f946163c480(0000) GS:ffff88839f1c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffffffffd6 CR3: 00000001099c1000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __lookup_slow+0x116/0x2d0 ? page_put_link+0x120/0x120 ? __d_lookup+0xfc/0x320 ? d_lookup+0x49/0x90 lookup_one_len+0x13c/0x170 ? __lookup_slow+0x2d0/0x2d0 ? reiserfs_schedule_old_flush+0x31/0x130 reiserfs_lookup_privroot+0x64/0x150 reiserfs_fill_super+0x158c/0x1b90 ? finish_unfinished+0xb10/0xb10 ? bprintf+0xe0/0xe0 ? __mutex_lock_slowpath+0x30/0x30 ? __kasan_check_write+0x20/0x30 ? up_write+0x51/0xb0 ? set_blocksize+0x9f/0x1f0 mount_bdev+0x27c/0x2d0 ? finish_unfinished+0xb10/0xb10 ? reiserfs_kill_sb+0x120/0x120 get_super_block+0x19/0x30 legacy_get_tree+0x76/0xf0 vfs_get_tree+0x49/0x160 ? capable+0x1d/0x30 path_mount+0xacc/0x1380 ? putname+0x97/0xd0 ? finish_automount+0x450/0x450 ? kmem_cache_free+0xf8/0x5a0 ? putname+0x97/0xd0 do_mount+0xe2/0x110 ? path_mount+0x1380/0x1380 ? copy_mount_options+0x69/0x140 __x64_sys_mount+0xf0/0x190 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae This is because 'root_inode' is initialized with wrong mode, and it's i_op is set to 'reiserfs_special_inode_operations'. Thus add check for 'root_inode' to fix the problem. Link: https://lore.kernel.org/r/20210702040743.1918552-1-yukuai3@huawei.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2021-08-12ext4: fix potential htree corruption when growing large_dir directoriesTheodore Ts'o
commit 877ba3f729fd3d8ef0e29bc2a55e57cfa54b2e43 upstream. Commit b5776e7524af ("ext4: fix potential htree index checksum corruption) removed a required restart when multiple levels of index nodes need to be split. Fix this to avoid directory htree corruptions when using the large_dir feature. Cc: stable@kernel.org # v5.11 Cc: Благодаренко Артём <artem.blagodarenko@gmail.com> Fixes: b5776e7524af ("ext4: fix potential htree index checksum corruption) Reported-by: Denis <denis@voxelsoft.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>