summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-02-28ext4: add cond_resched() to __ext4_find_entry()Shijie Luo
commit 9424ef56e13a1f14c57ea161eed3ecfdc7b2770e upstream. We tested a soft lockup problem in linux 4.19 which could also be found in linux 5.x. When dir inode takes up a large number of blocks, and if the directory is growing when we are searching, it's possible the restart branch could be called many times, and the do while loop could hold cpu a long time. Here is the call trace in linux 4.19. [ 473.756186] Call trace: [ 473.756196] dump_backtrace+0x0/0x198 [ 473.756199] show_stack+0x24/0x30 [ 473.756205] dump_stack+0xa4/0xcc [ 473.756210] watchdog_timer_fn+0x300/0x3e8 [ 473.756215] __hrtimer_run_queues+0x114/0x358 [ 473.756217] hrtimer_interrupt+0x104/0x2d8 [ 473.756222] arch_timer_handler_virt+0x38/0x58 [ 473.756226] handle_percpu_devid_irq+0x90/0x248 [ 473.756231] generic_handle_irq+0x34/0x50 [ 473.756234] __handle_domain_irq+0x68/0xc0 [ 473.756236] gic_handle_irq+0x6c/0x150 [ 473.756238] el1_irq+0xb8/0x140 [ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4] [ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4] [ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4] [ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4] [ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4] [ 473.756402] ext4_lookup+0x8c/0x258 [ext4] [ 473.756407] __lookup_hash+0x8c/0xe8 [ 473.756411] filename_create+0xa0/0x170 [ 473.756413] do_mkdirat+0x6c/0x140 [ 473.756415] __arm64_sys_mkdirat+0x28/0x38 [ 473.756419] el0_svc_common+0x78/0x130 [ 473.756421] el0_svc_handler+0x38/0x78 [ 473.756423] el0_svc+0x8/0xc [ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149] Add cond_resched() to avoid soft lockup and to provide a better system responding. Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix a data race in EXT4_I(inode)->i_disksizeQian Cai
commit 35df4299a6487f323b0aca120ea3f485dfee2ae3 upstream. EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4] write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127: ext4_write_end+0x4e3/0x750 [ext4] ext4_update_i_disksize at fs/ext4/ext4.h:3032 (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046 (inlined by) ext4_write_end at fs/ext4/inode.c:1287 generic_perform_write+0x208/0x2a0 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37: ext4_writepages+0x10ac/0x1d00 [ext4] mpage_map_and_submit_extent at fs/ext4/inode.c:2468 (inlined by) ext4_writepages at fs/ext4/inode.c:2772 do_writepages+0x5e/0x130 __writeback_single_inode+0xeb/0xb20 writeback_sb_inodes+0x429/0x900 __writeback_inodes_wb+0xc4/0x150 wb_writeback+0x4bd/0x870 wb_workfn+0x6b4/0x960 process_one_work+0x54c/0xbe0 worker_thread+0x80/0x650 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 Reported by Kernel Concurrency Sanitizer on: CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: writeback wb_workfn (flush-7:0) Since only the read is operating as lockless (outside of the "i_data_sem"), load tearing could introduce a logic bug. Fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the write. Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28jbd2: fix ocfs2 corrupt when clearing block group bitswangyan
commit 8eedabfd66b68a4623beec0789eac54b8c9d0fb6 upstream. I found a NULL pointer dereference in ocfs2_block_group_clear_bits(). The running environment: kernel version: 4.19 A cluster with two nodes, 5 luns mounted on two nodes, and do some file operations like dd/fallocate/truncate/rm on every lun with storage network disconnection. The fallocate operation on dm-23-45 caused an null pointer dereference. The information of NULL pointer dereference as follows: [577992.878282] JBD2: Error -5 detected when updating journal superblock for dm-23-45. [577992.878290] Aborting journal on device dm-23-45. ... [577992.890778] JBD2: Error -5 detected when updating journal superblock for dm-24-46. [577992.890908] __journal_remove_journal_head: freeing b_committed_data [577992.890916] (fallocate,88392,52):ocfs2_extend_trans:474 ERROR: status = -30 [577992.890918] __journal_remove_journal_head: freeing b_committed_data [577992.890920] (fallocate,88392,52):ocfs2_rotate_tree_right:2500 ERROR: status = -30 [577992.890922] __journal_remove_journal_head: freeing b_committed_data [577992.890924] (fallocate,88392,52):ocfs2_do_insert_extent:4382 ERROR: status = -30 [577992.890928] (fallocate,88392,52):ocfs2_insert_extent:4842 ERROR: status = -30 [577992.890928] __journal_remove_journal_head: freeing b_committed_data [577992.890930] (fallocate,88392,52):ocfs2_add_clusters_in_btree:4947 ERROR: status = -30 [577992.890933] __journal_remove_journal_head: freeing b_committed_data [577992.890939] __journal_remove_journal_head: freeing b_committed_data [577992.890949] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 [577992.890950] Mem abort info: [577992.890951] ESR = 0x96000004 [577992.890952] Exception class = DABT (current EL), IL = 32 bits [577992.890952] SET = 0, FnV = 0 [577992.890953] EA = 0, S1PTW = 0 [577992.890954] Data abort info: [577992.890955] ISV = 0, ISS = 0x00000004 [577992.890956] CM = 0, WnR = 0 [577992.890958] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000f8da07a9 [577992.890960] [0000000000000020] pgd=0000000000000000 [577992.890964] Internal error: Oops: 96000004 [#1] SMP [577992.890965] Process fallocate (pid: 88392, stack limit = 0x00000000013db2fd) [577992.890968] CPU: 52 PID: 88392 Comm: fallocate Kdump: loaded Tainted: G W OE 4.19.36 #1 [577992.890969] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019 [577992.890971] pstate: 60400009 (nZCv daif +PAN -UAO) [577992.891054] pc : _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891082] lr : _ocfs2_free_suballoc_bits+0x618/0x968 [ocfs2] [577992.891084] sp : ffff0000c8e2b810 [577992.891085] x29: ffff0000c8e2b820 x28: 0000000000000000 [577992.891087] x27: 00000000000006f3 x26: ffffa07957b02e70 [577992.891089] x25: ffff807c59d50000 x24: 00000000000006f2 [577992.891091] x23: 0000000000000001 x22: ffff807bd39abc30 [577992.891093] x21: ffff0000811d9000 x20: ffffa07535d6a000 [577992.891097] x19: ffff000001681638 x18: ffffffffffffffff [577992.891098] x17: 0000000000000000 x16: ffff000080a03df0 [577992.891100] x15: ffff0000811d9708 x14: 203d207375746174 [577992.891101] x13: 73203a524f525245 x12: 20373439343a6565 [577992.891103] x11: 0000000000000038 x10: 0101010101010101 [577992.891106] x9 : ffffa07c68a85d70 x8 : 7f7f7f7f7f7f7f7f [577992.891109] x7 : 0000000000000000 x6 : 0000000000000080 [577992.891110] x5 : 0000000000000000 x4 : 0000000000000002 [577992.891112] x3 : ffff000001713390 x2 : 2ff90f88b1c22f00 [577992.891114] x1 : ffff807bd39abc30 x0 : 0000000000000000 [577992.891116] Call trace: [577992.891139] _ocfs2_free_suballoc_bits+0x63c/0x968 [ocfs2] [577992.891162] _ocfs2_free_clusters+0x100/0x290 [ocfs2] [577992.891185] ocfs2_free_clusters+0x50/0x68 [ocfs2] [577992.891206] ocfs2_add_clusters_in_btree+0x198/0x5e0 [ocfs2] [577992.891227] ocfs2_add_inode_data+0x94/0xc8 [ocfs2] [577992.891248] ocfs2_extend_allocation+0x1bc/0x7a8 [ocfs2] [577992.891269] ocfs2_allocate_extents+0x14c/0x338 [ocfs2] [577992.891290] __ocfs2_change_file_space+0x3f8/0x610 [ocfs2] [577992.891309] ocfs2_fallocate+0xe4/0x128 [ocfs2] [577992.891316] vfs_fallocate+0x11c/0x250 [577992.891317] ksys_fallocate+0x54/0x88 [577992.891319] __arm64_sys_fallocate+0x28/0x38 [577992.891323] el0_svc_common+0x78/0x130 [577992.891325] el0_svc_handler+0x38/0x78 [577992.891327] el0_svc+0x8/0xc My analysis process as follows: ocfs2_fallocate __ocfs2_change_file_space ocfs2_allocate_extents ocfs2_extend_allocation ocfs2_add_inode_data ocfs2_add_clusters_in_btree ocfs2_insert_extent ocfs2_do_insert_extent ocfs2_rotate_tree_right ocfs2_extend_rotate_transaction ocfs2_extend_trans jbd2_journal_restart jbd2__journal_restart /* handle->h_transaction is NULL, * is_handle_aborted(handle) is true */ handle->h_transaction = NULL; start_this_handle return -EROFS; ocfs2_free_clusters _ocfs2_free_clusters _ocfs2_free_suballoc_bits ocfs2_block_group_clear_bits ocfs2_journal_access_gd __ocfs2_journal_access jbd2_journal_get_undo_access /* I think jbd2_write_access_granted() will * return true, because do_get_write_access() * will return -EROFS. */ if (jbd2_write_access_granted(...)) return 0; do_get_write_access /* handle->h_transaction is NULL, it will * return -EROFS here, so do_get_write_access() * was not called. */ if (is_handle_aborted(handle)) return -EROFS; /* bh2jh(group_bh) is NULL, caused NULL pointer dereference */ undo_bg = (struct ocfs2_group_desc *) bh2jh(group_bh)->b_committed_data; If handle->h_transaction == NULL, then jbd2_write_access_granted() does not really guarantee that journal_head will stay around, not even speaking of its b_committed_data. The bh2jh(group_bh) can be removed after ocfs2_journal_access_gd() and before call "bh2jh(group_bh)->b_committed_data". So, we should move is_handle_aborted() check from do_get_write_access() into jbd2_journal_get_undo_access() and jbd2_journal_get_write_access() before the call to jbd2_write_access_granted(). Link: https://lore.kernel.org/r/f72a623f-b3f1-381a-d91d-d22a1c83a336@huawei.com Signed-off-by: Yan Wang <wangyan122@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28btrfs: handle logged extent failure properlyJosef Bacik
commit bd727173e4432fe6cb70ba108dc1f3602c5409d7 upstream. If we're allocating a logged extent we attempt to insert an extent record for the file extent directly. We increase space_info->bytes_reserved, because the extent entry addition will call btrfs_update_block_group(), which will convert the ->bytes_reserved to ->bytes_used. However if we fail at any point while inserting the extent entry we will bail and leave space on ->bytes_reserved, which will trigger a WARN_ON() on umount. Fix this by pinning the space if we fail to insert, which is what happens in every other failure case that involves adding the extent entry. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28btrfs: don't set path->leave_spinning for truncateJosef Bacik
commit 52e29e331070cd7d52a64cbf1b0958212a340e28 upstream. The only time we actually leave the path spinning is if we're truncating a small amount and don't actually free an extent, which is not a common occurrence. We have to set the path blocking in order to add the delayed ref anyway, so the first extent we find we set the path to blocking and stay blocking for the duration of the operation. With the upcoming file extent map stuff there will be another case that we have to have the path blocking, so just swap to blocking always. Note: this patch also fixes a warning after 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap") got merged that inserts extent locks around truncation so the path must not leave spinning locks after btrfs_search_slot. [70.794783] BUG: sleeping function called from invalid context at mm/slab.h:565 [70.794834] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1141, name: rsync [70.794863] 5 locks held by rsync/1141: [70.794876] #0: ffff888417b9c408 (sb_writers#17){.+.+}, at: mnt_want_write+0x20/0x50 [70.795030] #1: ffff888428de28e8 (&type->i_mutex_dir_key#13/1){+.+.}, at: lock_rename+0xf1/0x100 [70.795051] #2: ffff888417b9c608 (sb_internal#2){.+.+}, at: start_transaction+0x394/0x560 [70.795124] #3: ffff888403081768 (btrfs-fs-01){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160 [70.795203] #4: ffff888403086568 (btrfs-fs-00){++++}, at: btrfs_try_tree_write_lock+0x2f/0x160 [70.795222] CPU: 5 PID: 1141 Comm: rsync Not tainted 5.6.0-rc2-backup+ #2 [70.795362] Call Trace: [70.795374] dump_stack+0x71/0xa0 [70.795445] ___might_sleep.part.96.cold.106+0xa6/0xb6 [70.795459] kmem_cache_alloc+0x1d3/0x290 [70.795471] alloc_extent_state+0x22/0x1c0 [70.795544] __clear_extent_bit+0x3ba/0x580 [70.795557] ? _raw_spin_unlock_irq+0x24/0x30 [70.795569] btrfs_truncate_inode_items+0x339/0xe50 [70.795647] btrfs_evict_inode+0x269/0x540 [70.795659] ? dput.part.38+0x29/0x460 [70.795671] evict+0xcd/0x190 [70.795682] __dentry_kill+0xd6/0x180 [70.795754] dput.part.38+0x2ad/0x460 [70.795765] do_renameat2+0x3cb/0x540 [70.795777] __x64_sys_rename+0x1c/0x20 Reported-by: Dave Jones <davej@codemonkey.org.uk> Fixes: 28553fa992cb ("Btrfs: fix race between shrinking truncate and fiemap") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28Btrfs: fix race between shrinking truncate and fiemapFilipe Manana
commit 28553fa992cb28be6a65566681aac6cafabb4f2d upstream. When there is a fiemap executing in parallel with a shrinking truncate we can end up in a situation where we have extent maps for which we no longer have corresponding file extent items. This is generally harmless and at the moment the only consequences are missing file extent items representing holes after we expand the file size again after the truncate operation removed the prealloc extent items, and stale information for future fiemap calls (reporting extents that no longer exist or may have been reallocated to other files for example). Consider the following example: 1) Our inode has a size of 128KiB, one 128KiB extent at file offset 0 and a 1MiB prealloc extent at file offset 128KiB; 2) Task A starts doing a shrinking truncate of our inode to reduce it to a size of 64KiB. Before it searches the subvolume tree for file extent items to delete, it drops all the extent maps in the range from 64KiB to (u64)-1 by calling btrfs_drop_extent_cache(); 3) Task B starts doing a fiemap against our inode. When looking up for the inode's extent maps in the range from 128KiB to (u64)-1, it doesn't find any in the inode's extent map tree, since they were removed by task A. Because it didn't find any in the extent map tree, it scans the inode's subvolume tree for file extent items, and it finds the 1MiB prealloc extent at file offset 128KiB, then it creates an extent map based on that file extent item and adds it to inode's extent map tree (this ends up being done by btrfs_get_extent() <- btrfs_get_extent_fiemap() <- get_extent_skip_holes()); 4) Task A then drops the prealloc extent at file offset 128KiB and shrinks the 128KiB extent file offset 0 to a length of 64KiB. The truncation operation finishes and we end up with an extent map representing a 1MiB prealloc extent at file offset 128KiB, despite we don't have any more that extent; After this the two types of problems we have are: 1) Future calls to fiemap always report that a 1MiB prealloc extent exists at file offset 128KiB. This is stale information, no longer correct; 2) If the size of the file is increased, by a truncate operation that increases the file size or by a write into a file offset > 64KiB for example, we end up not inserting file extent items to represent holes for any range between 128KiB and 128KiB + 1MiB, since the hole expansion function, btrfs_cont_expand() will skip hole insertion for any range for which an extent map exists that represents a prealloc extent. This causes fsck to complain about missing file extent items when not using the NO_HOLES feature. The second issue could be often triggered by test case generic/561 from fstests, which runs fsstress and duperemove in parallel, and duperemove does frequent fiemap calls. Essentially the problems happens because fiemap does not acquire the inode's lock while truncate does, and fiemap locks the file range in the inode's iotree while truncate does not. So fix the issue by making btrfs_truncate_inode_items() lock the file range from the new file size to (u64)-1, so that it serializes with fiemap. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ecryptfs: fix a memory leak bug in ecryptfs_init_messaging()Wenwen Wang
commit b4a81b87a4cfe2bb26a4a943b748d96a43ef20e8 upstream. In ecryptfs_init_messaging(), if the allocation for 'ecryptfs_msg_ctx_arr' fails, the previously allocated 'ecryptfs_daemon_hash' is not deallocated, leading to a memory leak bug. To fix this issue, free 'ecryptfs_daemon_hash' before returning the error. Cc: stable@vger.kernel.org Fixes: 88b4a07e6610 ("[PATCH] eCryptfs: Public key transport mechanism") Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ecryptfs: fix a memory leak bug in parse_tag_1_packet()Wenwen Wang
commit fe2e082f5da5b4a0a92ae32978f81507ef37ec66 upstream. In parse_tag_1_packet(), if tag 1 packet contains a key larger than ECRYPTFS_MAX_ENCRYPTED_KEY_BYTES, no cleanup is executed, leading to a memory leak on the allocated 'auth_tok_list_item'. To fix this issue, go to the label 'out_free' to perform the cleanup work. Cc: stable@vger.kernel.org Fixes: dddfa461fc89 ("[PATCH] eCryptfs: Public key; packet management") Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Tyler Hicks <tyhicks@canonical.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-24fuse: don't overflow LLONG_MAX with end offsetMiklos Szeredi
[ Upstream commit 2f1398291bf35fe027914ae7a9610d8e601fbfde ] Handle the special case of fuse_readpages() wanting to read the last page of a hugest file possible and overflowing the end offset in the process. This is basically to unbreak xfstests:generic/525 and prevent filesystems from doing bad things with an overflowing offset. Reported-by: Xiao Yang <ice_yangxiao@163.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24cifs: log warning message (once) if out of disk spaceSteve French
[ Upstream commit d6fd41905ec577851734623fb905b1763801f5ef ] We ran into a confusing problem where an application wasn't checking return code on close and so user didn't realize that the application ran out of disk space. log a warning message (once) in these cases. For example: [ 8407.391909] Out of space writing to \\oleg-server\small-share Signed-off-by: Steve French <stfrench@microsoft.com> Reported-by: Oleg Kravtsov <oleg@tuxera.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24help_next should increase position indexVasily Averin
[ Upstream commit 9f198a2ac543eaaf47be275531ad5cbd50db3edf ] if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24NFS: Fix memory leaksWenwen Wang
[ Upstream commit 123c23c6a7b7ecd2a3d6060bea1d94019f71fd66 ] In _nfs42_proc_copy(), 'res->commit_res.verf' is allocated through kzalloc() if 'args->sync' is true. In the following code, if 'res->synchronous' is false, handle_async_copy() will be invoked. If an error occurs during the invocation, the following code will not be executed and the error will be returned . However, the allocated 'res->commit_res.verf' is not deallocated, leading to a memory leak. This is also true if the invocation of process_copy_commit() returns an error. To fix the above leaks, redirect the execution to the 'out' label if an error is encountered. Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()Yunfeng Ye
[ Upstream commit aacee5446a2a1aa35d0a49dab289552578657fb4 ] The variable inode may be NULL in reiserfs_insert_item(), but there is no check before accessing the member of inode. Fix this by adding NULL pointer check before calling reiserfs_debug(). Link: http://lkml.kernel.org/r/79c5135d-ff25-1cc9-4e99-9f572b88cc00@huawei.com Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Cc: zhengbin <zhengbin13@huawei.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24ocfs2: fix a NULL pointer dereference when call ocfs2_update_inode_fsync_trans()wangyan
[ Upstream commit 9f16ca48fc818a17de8be1f75d08e7f4addc4497 ] I found a NULL pointer dereference in ocfs2_update_inode_fsync_trans(), handle->h_transaction may be NULL in this situation: ocfs2_file_write_iter ->__generic_file_write_iter ->generic_perform_write ->ocfs2_write_begin ->ocfs2_write_begin_nolock ->ocfs2_write_cluster_by_desc ->ocfs2_write_cluster ->ocfs2_mark_extent_written ->ocfs2_change_extent_flag ->ocfs2_split_extent ->ocfs2_try_to_merge_extent ->ocfs2_extend_rotate_transaction ->ocfs2_extend_trans ->jbd2_journal_restart ->jbd2__journal_restart // handle->h_transaction is NULL here ->handle->h_transaction = NULL; ->start_this_handle /* journal aborted due to storage network disconnection, return error */ ->return -EROFS; /* line 3806 in ocfs2_try_to_merge_extent (), it will ignore ret error. */ ->ret = 0; ->... ->ocfs2_write_end ->ocfs2_write_end_nolock ->ocfs2_update_inode_fsync_trans // NULL pointer dereference ->oi->i_sync_tid = handle->h_transaction->t_tid; The information of NULL pointer dereference as follows: JBD2: Detected IO errors while flushing file data on dm-11-45 Aborting journal on device dm-11-45. JBD2: Error -5 detected when updating journal superblock for dm-11-45. (dd,22081,3):ocfs2_extend_trans:474 ERROR: status = -30 (dd,22081,3):ocfs2_try_to_merge_extent:3877 ERROR: status = -30 Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 Mem abort info: ESR = 0x96000004 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000e74e1338 [0000000000000008] pgd=0000000000000000 Internal error: Oops: 96000004 [#1] SMP Process dd (pid: 22081, stack limit = 0x00000000584f35a9) CPU: 3 PID: 22081 Comm: dd Kdump: loaded Hardware name: Huawei TaiShan 2280 V2/BC82AMDD, BIOS 0.98 08/25/2019 pstate: 60400009 (nZCv daif +PAN -UAO) pc : ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2] lr : ocfs2_write_end_nolock+0x2a0/0x550 [ocfs2] sp : ffff0000459fba70 x29: ffff0000459fba70 x28: 0000000000000000 x27: ffff807ccf7f1000 x26: 0000000000000001 x25: ffff807bdff57970 x24: ffff807caf1d4000 x23: ffff807cc79e9000 x22: 0000000000001000 x21: 000000006c6cd000 x20: ffff0000091d9000 x19: ffff807ccb239db0 x18: ffffffffffffffff x17: 000000000000000e x16: 0000000000000007 x15: ffff807c5e15bd78 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000001 x9 : 0000000000000228 x8 : 000000000000000c x7 : 0000000000000fff x6 : ffff807a308ed6b0 x5 : ffff7e01f10967c0 x4 : 0000000000000018 x3 : d0bc661572445600 x2 : 0000000000000000 x1 : 000000001b2e0200 x0 : 0000000000000000 Call trace: ocfs2_write_end_nolock+0x2b8/0x550 [ocfs2] ocfs2_write_end+0x4c/0x80 [ocfs2] generic_perform_write+0x108/0x1a8 __generic_file_write_iter+0x158/0x1c8 ocfs2_file_write_iter+0x668/0x950 [ocfs2] __vfs_write+0x11c/0x190 vfs_write+0xac/0x1c0 ksys_write+0x6c/0xd8 __arm64_sys_write+0x24/0x30 el0_svc_common+0x78/0x130 el0_svc_handler+0x38/0x78 el0_svc+0x8/0xc To prevent NULL pointer dereference in this situation, we use is_handle_aborted() before using handle->h_transaction->t_tid. Link: http://lkml.kernel.org/r/03e750ab-9ade-83aa-b000-b9e81e34e539@huawei.com Signed-off-by: Yan Wang <wangyan122@huawei.com> Reviewed-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24ocfs2: make local header paths relative to C filesMasahiro Yamada
[ Upstream commit ca322fb6030956c2337fbf1c1beeb08c5dd5c943 ] Gang He reports the failure of building fs/ocfs2/ as an external module of the kernel installed on the system: $ cd fs/ocfs2 $ make -C /lib/modules/`uname -r`/build M=`pwd` modules If you want to make it work reliably, I'd recommend to remove ccflags-y from the Makefiles, and to make header paths relative to the C files. I think this is the correct usage of the #include "..." directive. Link: http://lkml.kernel.org/r/20191227022950.14804-1-ghe@suse.com Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Gang He <ghe@suse.com> Reported-by: Gang He <ghe@suse.com> Reviewed-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24btrfs: do not do delalloc reservation under page lockJosef Bacik
[ Upstream commit f4b1363cae43fef7c86c993b7ca7fe7d546b3c68 ] We ran into a deadlock in production with the fixup worker. The stack traces were as follows: Thread responsible for the writeout, waiting on the page lock [<0>] io_schedule+0x12/0x40 [<0>] __lock_page+0x109/0x1e0 [<0>] extent_write_cache_pages+0x206/0x360 [<0>] extent_writepages+0x40/0x60 [<0>] do_writepages+0x31/0xb0 [<0>] __writeback_single_inode+0x3d/0x350 [<0>] writeback_sb_inodes+0x19d/0x3c0 [<0>] __writeback_inodes_wb+0x5d/0xb0 [<0>] wb_writeback+0x231/0x2c0 [<0>] wb_workfn+0x308/0x3c0 [<0>] process_one_work+0x1e0/0x390 [<0>] worker_thread+0x2b/0x3c0 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 [<0>] 0xffffffffffffffff Thread of the fixup worker who is holding the page lock [<0>] start_delalloc_inodes+0x241/0x2d0 [<0>] btrfs_start_delalloc_roots+0x179/0x230 [<0>] btrfs_alloc_data_chunk_ondemand+0x11b/0x2e0 [<0>] btrfs_check_data_free_space+0x53/0xa0 [<0>] btrfs_delalloc_reserve_space+0x20/0x70 [<0>] btrfs_writepage_fixup_worker+0x1fc/0x2a0 [<0>] normal_work_helper+0x11c/0x360 [<0>] process_one_work+0x1e0/0x390 [<0>] worker_thread+0x2b/0x3c0 [<0>] kthread+0x113/0x130 [<0>] ret_from_fork+0x35/0x40 [<0>] 0xffffffffffffffff Thankfully the stars have to align just right to hit this. First you have to end up in the fixup worker, which is tricky by itself (my reproducer does DIO reads into a MMAP'ed region, so not a common operation). Then you have to have less than a page size of free data space and 0 unallocated space so you go down the "commit the transaction to free up pinned space" path. This was accomplished by a random balance that was running on the host. Then you get this deadlock. I'm still in the process of trying to force the deadlock to happen on demand, but I've hit other issues. I can still trigger the fixup worker path itself so this patch has been tested in that regard, so the normal case is fine. Fixes: 87826df0ec36 ("btrfs: delalloc for page dirtied out-of-band in fixup worker") Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24ceph: check availability of mds cluster on mount after wait timeoutXiubo Li
[ Upstream commit 97820058fb2831a4b203981fa2566ceaaa396103 ] If all the MDS daemons are down for some reason, then the first mount attempt will fail with EIO after the mount request times out. A mount attempt will also fail with EIO if all of the MDS's are laggy. This patch changes the code to return -EHOSTUNREACH in these situations and adds a pr_info error message to help the admin determine the cause. URL: https://tracker.ceph.com/issues/4386 Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24cifs: fix NULL dereference in match_prepathRonnie Sahlberg
[ Upstream commit fe1292686333d1dadaf84091f585ee903b9ddb84 ] RHBZ: 1760879 Fix an oops in match_prepath() by making sure that the prepath string is not NULL before we pass it into strcmp(). This is similar to other checks we make for example in cifs_root_iget() Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24cifs: Fix mount options set in automountPaulo Alcantara (SUSE)
[ Upstream commit 5739375ee4230980166807d347cc21c305532bbc ] Starting from 4a367dc04435, we must set the mount options based on the DFS full path rather than the resolved target, that is, cifs_mount() will be responsible for resolving the DFS link (cached) as well as performing failover to any other targets in the referral. Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reported-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com> Fixes: 4a367dc04435 ("cifs: Add support for failover in cifs_mount()") Link: https://lore.kernel.org/linux-cifs/39643d7d-2abb-14d3-ced6-c394fab9a777@prodrive-technologies.com Tested-by: Martijn de Gouw <martijn.de.gouw@prodrive-technologies.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24cifs: fix unitialized variable poential problem with network I/O cache lock ↵Steve French
patch [ Upstream commit 463a7b457c02250a84faa1d23c52da9e3364aed2 ] static analysis with Coverity detected an issue with the following commit: Author: Paulo Alcantara (SUSE) <pc@cjr.nz> Date: Wed Dec 4 17:38:03 2019 -0300 cifs: Avoid doing network I/O while holding cache lock Addresses-Coverity: ("Uninitialized pointer read") Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24jbd2: make sure ESHUTDOWN to be recorded in the journal superblockzhangyi (F)
[ Upstream commit 0e98c084a21177ef136149c6a293b3d1eb33ff92 ] Commit fb7c02445c49 ("ext4: pass -ESHUTDOWN code to jbd2 layer") want to allow jbd2 layer to distinguish shutdown journal abort from other error cases. So the ESHUTDOWN should be taken precedence over any other errno which has already been recoded after EXT4_FLAGS_SHUTDOWN is set, but it only update errno in the journal suoerblock now if the old errno is 0. Fixes: fb7c02445c49 ("ext4: pass -ESHUTDOWN code to jbd2 layer") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24jbd2: switch to use jbd2_journal_abort() when failed to submit the commit recordzhangyi (F)
[ Upstream commit d0a186e0d3e7ac05cc77da7c157dae5aa59f95d9 ] We invoke jbd2_journal_abort() to abort the journal and record errno in the jbd2 superblock when committing journal transaction besides the failure on submitting the commit record. But there is no need for the case and we can also invoke jbd2_journal_abort() instead of __jbd2_journal_abort_hard(). Fixes: 818d276ceb83a ("ext4: Add the journal checksum feature") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24btrfs: Fix split-brain handling when changing FSID to metadata uuidNikolay Borisov
[ Upstream commit 1362089d2ad7e20d16371b39d3c11990d4ec23e4 ] Current code doesn't correctly handle the situation which arises when a file system that has METADATA_UUID_INCOMPAT flag set and has its FSID changed to the one in metadata uuid. This causes the incompat flag to disappear. In case of a power failure we could end up in a situation where part of the disks in a multi-disk filesystem are correctly reverted to METADATA_UUID_INCOMPAT flag unset state, while others have METADATA_UUID_INCOMPAT set and CHANGING_FSID_V2_IN_PROGRESS. This patch corrects the behavior required to handle the case where a disk of the second type is scanned first, creating the necessary btrfs_fs_devices. Subsequently, when a disk which has already completed the transition is scanned it should overwrite the data in btrfs_fs_devices. Reported-by: Su Yue <Damenly_Su@gmx.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24btrfs: separate definition of assertion failure handlersDavid Sterba
[ Upstream commit 68c467cbb2f389b6c933e235bce0d1756fc8cc34 ] There's a report where objtool detects unreachable instructions, eg.: fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction This seems to be a false positive due to compiler version. The cause is in the ASSERT macro implementation that does the conditional check as IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef. To avoid that, use the ifdefs directly. There are still 2 reports that aren't fixed: fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24btrfs: device stats, log when stats are zeroedAnand Jain
[ Upstream commit a69976bc69308aa475d0ba3b8b3efd1d013c0460 ] We had a report indicating that some read errors aren't reported by the device stats in the userland. It is important to have the errors reported in the device stat as user land scripts might depend on it to take the reasonable corrective actions. But to debug these issue we need to be really sure that request to reset the device stat did not come from the userland itself. So log an info message when device error reset happens. For example: BTRFS info (device sdc): device stats zeroed by btrfs(9223) Reported-by: philip@philip-seeger.de Link: https://www.spinics.net/lists/linux-btrfs/msg96528.html Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24btrfs: safely advance counter when looking up bio csumsDavid Sterba
[ Upstream commit 4babad10198fa73fe73239d02c2e99e3333f5f5c ] Dan's smatch tool reports fs/btrfs/file-item.c:295 btrfs_lookup_bio_sums() warn: should this be 'count == -1' which points to the while (count--) loop. With count == 0 the check itself could decrement it to -1. There's a WARN_ON a few lines below that has never been seen in practice though. It turns out that the value of page_bytes_left matches the count (by sectorsize multiples). The loop never reaches the state where count would go to -1, because page_bytes_left == 0 is found first and this breaks out. For clarity, use only plain check on count (and only for positive value), decrement safely inside the loop. Any other discrepancy after the whole bio list processing should be reported by the exising WARN_ON_ONCE as well. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24btrfs: fix possible NULL-pointer dereference in integrity checksJohannes Thumshirn
[ Upstream commit 3dbd351df42109902fbcebf27104149226a4fcd9 ] A user reports a possible NULL-pointer dereference in btrfsic_process_superblock(). We are assigning state->fs_info to a local fs_info variable and afterwards checking for the presence of state. While we would BUG_ON() a NULL state anyways, we can also just remove the local fs_info copy, as fs_info is only used once as the first argument for btrfs_num_copies(). There we can just pass in state->fs_info as well. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=205003 Signed-off-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24f2fs: fix memleak of kobjectChao Yu
[ Upstream commit fe396ad8e7526f059f7b8c7290d33a1b84adacab ] If kobject_init_and_add() failed, caller needs to invoke kobject_put() to release kobject explicitly. Signed-off-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24f2fs: free sysfs kobjectJaegeuk Kim
[ Upstream commit 820d366736c949ffe698d3b3fe1266a91da1766d ] Detected kmemleak. Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24f2fs: set I_LINKABLE early to avoid wrong access by vfsJaegeuk Kim
[ Upstream commit 5b1dbb082f196278f82b6a15a13848efacb9ff11 ] This patch moves setting I_LINKABLE early in rename2(whiteout) to avoid the below warning. [ 3189.163385] WARNING: CPU: 3 PID: 59523 at fs/inode.c:358 inc_nlink+0x32/0x40 [ 3189.246979] Call Trace: [ 3189.248707] f2fs_init_inode_metadata+0x2d6/0x440 [f2fs] [ 3189.251399] f2fs_add_inline_entry+0x162/0x8c0 [f2fs] [ 3189.254010] f2fs_add_dentry+0x69/0xe0 [f2fs] [ 3189.256353] f2fs_do_add_link+0xc5/0x100 [f2fs] [ 3189.258774] f2fs_rename2+0xabf/0x1010 [f2fs] [ 3189.261079] vfs_rename+0x3f8/0xaa0 [ 3189.263056] ? tomoyo_path_rename+0x44/0x60 [ 3189.265283] ? do_renameat2+0x49b/0x550 [ 3189.267324] do_renameat2+0x49b/0x550 [ 3189.269316] __x64_sys_renameat2+0x20/0x30 [ 3189.271441] do_syscall_64+0x5a/0x230 [ 3189.273410] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 3189.275848] RIP: 0033:0x7f270b4d9a49 Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24udf: Fix free space reporting for metadata and virtual partitionsJan Kara
[ Upstream commit a4a8b99ec819ca60b49dc582a4287ef03411f117 ] Free space on filesystems with metadata or virtual partition maps currently gets misreported. This is because these partitions are just remapped onto underlying real partitions from which keep track of free blocks. Take this remapping into account when counting free blocks as well. Reviewed-by: Pali Rohár <pali.rohar@gmail.com> Reported-by: Pali Rohár <pali.rohar@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24nfsd: Clone should commit src file metadata tooTrond Myklebust
[ Upstream commit 57f64034966fb945fc958f95f0c51e47af590344 ] vfs_clone_file_range() can modify the metadata on the source file too, so we need to commit that to stable storage as well. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Acked-by: Dave Chinner <david@fromorbit.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24nfs: fix timstamp debug printsArnd Bergmann
[ Upstream commit 057f184b1245150b88e59997fc6f1af0e138d42e ] Starting in v5.5, the timestamps are correctly passed down as 64-bit seconds with NFSv4 on 32-bit machines, but some debug statements still truncate them to 'long'. Fixes: e86d5a02874c ("NFS: Convert struct nfs_fattr to use struct timespec64") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24reiserfs: Fix spurious unlock in reiserfs_fill_super() error handlingJan Kara
[ Upstream commit 4d5c1adaf893b8aa52525d2b81995e949bcb3239 ] When we fail to allocate string for journal device name we jump to 'error' label which tries to unlock reiserfs write lock which is not held. Jump to 'error_unlocked' instead. Fixes: f32485be8397 ("reiserfs: delay reiserfs lock until journal initialization") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24Btrfs: keep pages dirty when using btrfs_writepage_fixup_workerChris Mason
[ Upstream commit 25f3c5021985e885292980d04a1423fd83c967bb ] For COW, btrfs expects pages dirty pages to have been through a few setup steps. This includes reserving space for the new block allocations and marking the range in the state tree for delayed allocation. A few places outside btrfs will dirty pages directly, especially when unmapping mmap'd pages. In order for these to properly go through COW, we run them through a fixup worker to wait for stable pages, and do the delalloc prep. 87826df0ec36 added a window where the dirty pages were cleaned, but pending more action from the fixup worker. We clear_page_dirty_for_io() before we call into writepage, so the page is no longer dirty. The commit changed it so now we leave the page clean between unlocking it here and the fixup worker starting at some point in the future. During this window, page migration can jump in and relocate the page. Once our fixup work actually starts, it finds page->mapping is NULL and we end up freeing the page without ever writing it. This leads to crc errors and other exciting problems, since it screws up the whole statemachine for waiting for ordered extents. The fix here is to keep the page dirty while we're waiting for the fixup worker to get to work. This is accomplished by returning -EAGAIN from btrfs_writepage_cow_fixup if we queued the page up for fixup, which will cause the writepage function to redirty the page. Because we now expect the page to be dirty once it gets to the fixup worker we must adjust the error cases to call clear_page_dirty_for_io() on the page. That is the bulk of the patch, but it is not the fix, the fix is the -EAGAIN from btrfs_writepage_cow_fixup. We cannot separate these two changes out because the error conditions change with the new expectations. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24ext4, jbd2: ensure panic when aborting with zero errnozhangyi (F)
[ Upstream commit 51f57b01e4a3c7d7bdceffd84de35144e8c538e7 ] JBD2_REC_ERR flag used to indicate the errno has been updated when jbd2 aborted, and then __ext4_abort() and ext4_handle_error() can invoke panic if ERRORS_PANIC is specified. But if the journal has been aborted with zero errno, jbd2_journal_abort() didn't set this flag so we can no longer panic. Fix this by always record the proper errno in the journal superblock. Fixes: 4327ba52afd03 ("ext4, jbd2: ensure entering into panic after recording an error in superblock") Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191204124614.45424-3-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24udf: Allow writing to 'Rewritable' partitionsJan Kara
[ Upstream commit 15fb05fd286ac57a0802d71624daeb5c1c2d5b07 ] UDF 2.60 standard states in section 2.2.14.2: A partition with Access Type 3 (rewritable) shall define a Freed Space Bitmap or a Freed Space Table, see 2.3.3. All other partitions shall not define a Freed Space Bitmap or a Freed Space Table. Rewritable partitions are used on media that require some form of preprocessing before re-writing data (for example legacy MO). Such partitions shall use Access Type 3. Overwritable partitions are used on media that do not require preprocessing before overwriting data (for example: CD-RW, DVD-RW, DVD+RW, DVD-RAM, BD-RE, HD DVD-Rewritable). Such partitions shall use Access Type 4. however older versions of the standard didn't have this wording and there are tools out there that create UDF filesystems with rewritable partitions but that don't contain a Freed Space Bitmap or a Freed Space Table on media that does not require pre-processing before overwriting a block. So instead of forcing media with rewritable partition read-only, base this decision on presence of a Freed Space Bitmap or a Freed Space Table. Reported-by: Pali Rohár <pali.rohar@gmail.com> Reviewed-by: Pali Rohár <pali.rohar@gmail.com> Fixes: b085fbe2ef7f ("udf: Fix crash during mount") Link: https://lore.kernel.org/linux-fsdevel/20200112144735.hj2emsoy4uwsouxz@pali Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24ext4: fix deadlock allocating bio_post_read_ctx from mempoolEric Biggers
[ Upstream commit 68e45330e341dad2d3a0a3f8ef2ec46a2a0a3bbc ] Without any form of coordination, any case where multiple allocations from the same mempool are needed at a time to make forward progress can deadlock under memory pressure. This is the case for struct bio_post_read_ctx, as one can be allocated to decrypt a Merkle tree page during fsverity_verify_bio(), which itself is running from a post-read callback for a data bio which has its own struct bio_post_read_ctx. Fix this by freeing the first bio_post_read_ctx before calling fsverity_verify_bio(). This works because verity (if enabled) is always the last post-read step. This deadlock can be reproduced by trying to read from an encrypted verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching mempool_alloc() to pretend that pool->alloc() always fails. Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually hit this bug in practice would require reading from lots of encrypted verity files at the same time. But it's theoretically possible, as N available objects isn't enough to guarantee forward progress when > N/2 threads each need 2 objects at a time. Fixes: 22cfe4b48ccb ("ext4: add fs-verity read support") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181222.47684-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24jbd2: clear JBD2_ABORT flag before journal_reset to update log tail info ↵Kai Li
when load journal [ Upstream commit a09decff5c32060639a685581c380f51b14e1fc2 ] If the journal is dirty when the filesystem is mounted, jbd2 will replay the journal but the journal superblock will not be updated by journal_reset() because JBD2_ABORT flag is still set (it was set in journal_init_common()). This is problematic because when a new transaction is then committed, it will be recorded in block 1 (journal->j_tail was set to 1 in journal_reset()). If unclean shutdown happens again before the journal superblock is updated, the new recorded transaction will not be replayed during the next mount (because of stale sb->s_start and sb->s_sequence values) which can lead to filesystem corruption. Fixes: 85e0c4e89c1b ("jbd2: if the journal is aborted then don't allow update of the log tail") Signed-off-by: Kai Li <li.kai4@h3c.com> Link: https://lore.kernel.org/r/20200111022542.5008-1-li.kai4@h3c.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAITRitesh Harjani
[ Upstream commit f629afe3369e9885fd6e9cc7a4f514b6a65cf9e9 ] Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. So change our dax read/write methods to just do the trylock for the RWF_NOWAIT case. This seems to fix AIM7 regression in some scalable filesystems upto ~25% in some cases. Claimed in commit 942491c9e6d6 ("xfs: fix AIM7 regression") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24f2fs: call f2fs_balance_fs outside of locked pageJaegeuk Kim
[ Upstream commit bdf03299248916640a835a05d32841bb3d31912d ] Otherwise, we can hit deadlock by waiting for the locked page in move_data_block in GC. Thread A Thread B - do_page_mkwrite - f2fs_vm_page_mkwrite - lock_page - f2fs_balance_fs - mutex_lock(gc_mutex) - f2fs_gc - do_garbage_collect - ra_data_block - grab_cache_page - f2fs_balance_fs - mutex_lock(gc_mutex) Fixes: 39a8695824510 ("f2fs: refactor ->page_mkwrite() flow") Reviewed-by: Chao Yu <yuchao0@huawei.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24f2fs: preallocate DIO blocks when forcing buffered_ioJaegeuk Kim
[ Upstream commit 47501f87c61ad2aa234add63e1ae231521dbc3f5 ] The previous preallocation and DIO decision like below. allow_outplace_dio !allow_outplace_dio f2fs_force_buffered_io (*) No_Prealloc / Buffered_IO Prealloc / Buffered_IO !f2fs_force_buffered_io No_Prealloc / DIO Prealloc / DIO But, Javier reported Case (*) where zoned device bypassed preallocation but fell back to buffered writes in f2fs_direct_IO(), resulting in stale data being read. In order to fix the issue, actually we need to preallocate blocks whenever we fall back to buffered IO like this. No change is made in the other cases. allow_outplace_dio !allow_outplace_dio f2fs_force_buffered_io (*) Prealloc / Buffered_IO Prealloc / Buffered_IO !f2fs_force_buffered_io No_Prealloc / DIO Prealloc / DIO Reported-and-tested-by: Javier Gonzalez <javier@javigon.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Chao Yu <yuchao0@huawei.com> Reviewed-by: Javier González <javier@javigon.com> Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-19NFSv4: Add accounting for the number of active delegations heldTrond Myklebust
[ Upstream commit d2269ea14ebd2a73f291d6b3a7a7d320ec00270c ] In order to better manage our delegation caching, add a counter to track the number of active delegations. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-19io-wq: add support for inheriting ->fsJens Axboe
[ Upstream commit 9392a27d88b9707145d713654eb26f0c29789e50 ] Some work items need this for relative path lookup, make it available like the other inherited credentials/mm/etc. Cc: stable@vger.kernel.org # 5.3+ Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-19ext4: choose hardlimit when softlimit is larger than hardlimit in ↵Chengguang Xu
ext4_statfs_project() [ Upstream commit 57c32ea42f8e802bda47010418e25043e0c9337f ] Setting softlimit larger than hardlimit seems meaningless for disk quota but currently it is allowed. In this case, there may be a bit of comfusion for users when they run df comamnd to directory which has project quota. For example, we set 20M softlimit and 10M hardlimit of block usage limit for project quota of test_dir(project id 123). [root@hades mnt_ext4]# repquota -P -a *** Report for project quotas on device /dev/loop0 Block grace time: 7days; Inode grace time: 7days Block limits File limits Project used soft hard grace used soft hard grace ---------------------------------------------------------------------- 0 -- 13 0 0 2 0 0 123 -- 10237 20480 10240 5 200 100 The result of df command as below: [root@hades mnt_ext4]# df -h test_dir Filesystem Size Used Avail Use% Mounted on /dev/loop0 20M 10M 10M 50% /home/cgxu/test/mnt_ext4 Even though it looks like there is another 10M free space to use, if we write new data to diretory test_dir(inherit project id), the write will fail with errno(-EDQUOT). After this patch, the df result looks like below. [root@hades mnt_ext4]# df -h test_dir Filesystem Size Used Avail Use% Mounted on /dev/loop0 10M 10M 3.0K 100% /home/cgxu/test/mnt_ext4 Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191016022501.760-1-cgxu519@mykernel.net Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-19NFSv4: Ensure the delegation cred is pinned when we call delegreturnTrond Myklebust
commit 5d63944f8206a80636ae8cb4b9107d3b49f43d37 upstream. Ensure we don't release the delegation cred during the call to nfs4_proc_delegreturn(). Fixes: ee05f456772d ("NFSv4: Fix races between open and delegreturn") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19NFSv4.1 make cachethis=no for writesOlga Kornievskaia
commit cd1b659d8ce7697ee9799b64f887528315b9097b upstream. Turning caching off for writes on the server should improve performance. Fixes: fba83f34119a ("NFS: Pass "privileged" value to nfs4_init_sequence()") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ceph: noacl mount option is effectively ignoredXiubo Li
commit 3b20bc2fe4c0cfd82d35838965dc7ff0b93415c6 upstream. For the old mount API, the module parameters parseing function will be called in ceph_mount() and also just after the default posix acl flag set, so we can control to enable/disable it via the mount option. But for the new mount API, it will call the module parameters parseing function before ceph_get_tree(), so the posix acl will always be enabled. Fixes: 82995cc6c5ae ("libceph, rbd, ceph: convert to use the new mount API") Signed-off-by: Xiubo Li <xiubli@redhat.com> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19cifs: fix mount option display for sec=krb5iPetr Pavlu
commit 3f6166aaf19902f2f3124b5426405e292e8974dd upstream. Fix display for sec=krb5i which was wrongly interleaved by cruid, resulting in string "sec=krb5,cruid=<...>i" instead of "sec=krb5i,cruid=<...>". Fixes: 96281b9e46eb ("smb3: for kerberos mounts display the credential uid used") Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19jbd2: do not clear the BH_Mapped flag when forgetting a metadata bufferzhangyi (F)
commit c96dceeabf765d0b1b1f29c3bf50a5c01315b820 upstream. Commit 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from an older transaction") set the BH_Freed flag when forgetting a metadata buffer which belongs to the committing transaction, it indicate the committing process clear dirty bits when it is done with the buffer. But it also clear the BH_Mapped flag at the same time, which may trigger below NULL pointer oops when block_size < PAGE_SIZE. rmdir 1 kjournald2 mkdir 2 jbd2_journal_commit_transaction commit transaction N jbd2_journal_forget set_buffer_freed(bh1) jbd2_journal_commit_transaction commit transaction N+1 ... clear_buffer_mapped(bh1) ext4_getblk(bh2 ummapped) ... grow_dev_page init_page_buffers bh1->b_private=NULL bh2->b_private=NULL jbd2_journal_put_journal_head(jh1) __journal_remove_journal_head(hb1) jh1 is NULL and trigger oops *) Dir entry block bh1 and bh2 belongs to one page, and the bh2 has already been unmapped. For the metadata buffer we forgetting, we should always keep the mapped flag and clear the dirty flags is enough, so this patch pick out the these buffers and keep their BH_Mapped flag. Link: https://lore.kernel.org/r/20200213063821.30455-3-yi.zhang@huawei.com Fixes: 904cdbd41d74 ("jbd2: clear dirty flag when revoking a buffer from an older transaction") Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>