summaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2020-05-02ext4: check for non-zero journal inum in ext4_calculate_overheadRitesh Harjani
commit f1eec3b0d0a849996ebee733b053efa71803dad5 upstream. While calculating overhead for internal journal, also check that j_inum shouldn't be 0. Otherwise we get below error with xfstests generic/050 with external journal (XXX_LOGDEV config) enabled. It could be simply reproduced with loop device with an external journal and marking blockdev as RO before mounting. [ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode # ------------[ cut here ]------------ generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2) WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0 CPU: 107 PID: 115347 Comm: mount Tainted: G L --------- -t - 4.18.0-167.el8.ppc64le #1 NIP: c0000000006f6d44 LR: c0000000006f6d40 CTR: 0000000030041dd4 <...> NIP [c0000000006f6d44] generic_make_request_checks+0x6b4/0x7d0 LR [c0000000006f6d40] generic_make_request_checks+0x6b0/0x7d0 <...> Call Trace: generic_make_request_checks+0x6b0/0x7d0 (unreliable) generic_make_request+0x3c/0x420 submit_bio+0xd8/0x200 submit_bh_wbc+0x1e8/0x250 __sync_dirty_buffer+0xd0/0x210 ext4_commit_super+0x310/0x420 [ext4] __ext4_error+0xa4/0x1e0 [ext4] __ext4_iget+0x388/0xe10 [ext4] ext4_get_journal_inode+0x40/0x150 [ext4] ext4_calculate_overhead+0x5a8/0x610 [ext4] ext4_fill_super+0x3188/0x3260 [ext4] mount_bdev+0x778/0x8f0 ext4_mount+0x28/0x50 [ext4] mount_fs+0x74/0x230 vfs_kern_mount.part.6+0x6c/0x250 do_mount+0x2fc/0x1280 sys_mount+0x158/0x180 system_call+0x5c/0x70 EXT4-fs (pmem1p2): no journal found EXT4-fs (pmem1p2): can't get journal size EXT4-fs (pmem1p2): mounted filesystem without journal. Opts: dax,norecovery Fixes: 3c816ded78bb ("ext4: use journal inode to determine journal overhead") Reported-by: Harish Sriram <harish@linux.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200316093038.25485-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-05-02ext4: convert BUG_ON's to WARN_ON's in mballoc.cTheodore Ts'o
[ Upstream commit 907ea529fc4c3296701d2bfc8b831dd2a8121a34 ] If the in-core buddy bitmap gets corrupted (or out of sync with the block bitmap), issue a WARN_ON and try to recover. In most cases this involves skipping trying to allocate out of a particular block group. We can end up declaring the file system corrupted, which is fair, since the file system probably should be checked before we proceed any further. Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu Google-Bug-Id: 34811296 Google-Bug-Id: 34639169 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-02ext4: increase wait time needed before reuse of deleted inode numbersTheodore Ts'o
[ Upstream commit a17a9d935dc4a50acefaf319d58030f1da7f115a ] Current wait times have proven to be too short to protect against inode reuses that lead to metadata inconsistencies. Now that we will retry the inode allocation if we can't find any recently deleted inodes, it's a lot safer to increase the recently deleted time from 5 seconds to a minute. Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu Google-Bug-Id: 36602237 Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-05-02ext4: use matching invalidatepage in ext4_writepageyangerkun
[ Upstream commit c2a559bc0e7ed5a715ad6b947025b33cb7c05ea7 ] Run generic/388 with journal data mode sometimes may trigger the warning in ext4_invalidatepage. Actually, we should use the matching invalidatepage in ext4_writepage. Signed-off-by: yangerkun <yangerkun@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200226041002.13914-1-yangerkun@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-29ext4: fix extent_status fragmentation for plain filesDmitry Monakhov
[ Upstream commit 4068664e3cd2312610ceac05b74c4cf1853b8325 ] Extents are cached in read_extent_tree_block(); as a result, extents are not cached for inodes with depth == 0 when we try to find the extent using ext4_find_extent(). The result of the lookup is cached in ext4_map_blocks() but is only a subset of the extent on disk. As a result, the contents of extents status cache can get very badly fragmented for certain workloads, such as a random 4k read workload. File size of /mnt/test is 33554432 (8192 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 8191: 40960.. 49151: 8192: last,eof $ perf record -e 'ext4:ext4_es_*' /root/bin/fio --name=t --direct=0 --rw=randread --bs=4k --filesize=32M --size=32M --filename=/mnt/test $ perf script | grep ext4_es_insert_extent | head -n 10 fio 131 [000] 13.975421: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [494/1) mapped 41454 status W fio 131 [000] 13.975939: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6064/1) mapped 47024 status W fio 131 [000] 13.976467: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6907/1) mapped 47867 status W fio 131 [000] 13.976937: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3850/1) mapped 44810 status W fio 131 [000] 13.977440: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3292/1) mapped 44252 status W fio 131 [000] 13.977931: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6882/1) mapped 47842 status W fio 131 [000] 13.978376: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3117/1) mapped 44077 status W fio 131 [000] 13.978957: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [2896/1) mapped 43856 status W fio 131 [000] 13.979474: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [7479/1) mapped 48439 status W Fix this by caching the extents for inodes with depth == 0 in ext4_find_extent(). [ Renamed ext4_es_cache_extents() to ext4_cache_extents() since this newly added function is not in extents_cache.c, and to avoid potential visual confusion with ext4_es_cache_extent(). -TYT ] Signed-off-by: Dmitry Monakhov <dmonakhov@gmail.com> Link: https://lore.kernel.org/r/20191106122502.19986-1-dmonakhov@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23ext4: do not commit super on read-only bdevEric Sandeen
[ Upstream commit c96e2b8564adfb8ac14469ebc51ddc1bfecb3ae2 ] Under some circumstances we may encounter a filesystem error on a read-only block device, and if we try to save the error info to the superblock and commit it, we'll wind up with a noisy error and backtrace, i.e.: [ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode # ------------[ cut here ]------------ generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2) WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0 ... To avoid this, commit the error info in the superblock only if the block device is writable. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/4b6e774d-cc00-3469-7abb-108eb151071a@sandeen.net Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-04-23ext4: use non-movable memory for superblock readaheadRoman Gushchin
commit d87f639258a6a5980183f11876c884931ad93da2 upstream. Since commit a8ac900b8163 ("ext4: use non-movable memory for the superblock") buffers for ext4 superblock were allocated using the sb_bread_unmovable() helper which allocated buffer heads out of non-movable memory blocks. It was necessarily to not block page migrations and do not cause cma allocation failures. However commit 85c8f176a611 ("ext4: preload block group descriptors") broke this by introducing pre-reading of the ext4 superblock. The problem is that __breadahead() is using __getblk() underneath, which allocates buffer heads out of movable memory. It resulted in page migration failures I've seen on a machine with an ext4 partition and a preallocated cma area. Fix this by introducing sb_breadahead_unmovable() and __breadahead_gfp() helpers which use non-movable memory for buffer head allocations and use them for the ext4 superblock readahead. Reviewed-by: Andreas Dilger <adilger@dilger.ca> Fixes: 85c8f176a611 ("ext4: preload block group descriptors") Signed-off-by: Roman Gushchin <guro@fb.com> Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21ext4: do not zeroout extents beyond i_disksizeJan Kara
commit 801674f34ecfed033b062a0f217506b93c8d5e8a upstream. We do not want to create initialized extents beyond end of file because for e2fsck it is impossible to distinguish them from a case of corrupted file size / extent tree and so it complains like: Inode 12, i_size is 147456, should be 163840. Fix? no Code in ext4_ext_convert_to_initialized() and ext4_split_convert_extents() try to make sure it does not create initialized extents beyond inode size however they check against inode->i_size which is wrong. They should instead check against EXT4_I(inode)->i_disksize which is the current inode size on disk. That's what e2fsck is going to see in case of crash before all dirty data is written. This bug manifests as generic/456 test failure (with recent enough fstests where fsx got fixed to properly pass FALLOC_KEEP_SIZE_FL flags to the kernel) when run with dioread_lock mount option. CC: stable@vger.kernel.org Fixes: 21ca087a3891 ("ext4: Do not zero out uninitialized extents beyond i_size") Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200331105016.8674-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21ext4: fix incorrect inodes per group in error messageJosh Triplett
commit b9c538da4e52a7b79dfcf4cfa487c46125066dfb upstream. If ext4_fill_super detects an invalid number of inodes per group, the resulting error message printed the number of blocks per group, rather than the number of inodes per group. Fix it to print the correct value. Fixes: cd6bb35bf7f6d ("ext4: use more strict checks for inodes_per_block on mount") Link: https://lore.kernel.org/r/8be03355983a08e5d4eed480944613454d7e2550.1585434649.git.josh@joshtriplett.org Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-21ext4: fix incorrect group count in ext4_fill_super error messageJosh Triplett
commit df41460a21b06a76437af040d90ccee03888e8e5 upstream. ext4_fill_super doublechecks the number of groups before mounting; if that check fails, the resulting error message prints the group count from the ext4_sb_info sbi, which hasn't been set yet. Print the freshly computed group count instead (which at that point has just been computed in "blocks_count"). Signed-off-by: Josh Triplett <josh@joshtriplett.org> Fixes: 4ec1102813798 ("ext4: Add sanity checks for the superblock before mounting the filesystem") Link: https://lore.kernel.org/r/8b957cd1513fcc4550fe675c10bcce2175c33a49.1585431964.git.josh@joshtriplett.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-04-17ext4: fix a data race at inode->i_blocksQian Cai
commit 28936b62e71e41600bab319f262ea9f9b1027629 upstream. inode->i_blocks could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in ext4_do_update_inode [ext4] / inode_add_bytes write to 0xffff9a00d4b982d0 of 8 bytes by task 22100 on cpu 118: inode_add_bytes+0x65/0xf0 __inode_add_bytes at fs/stat.c:689 (inlined by) inode_add_bytes at fs/stat.c:702 ext4_mb_new_blocks+0x418/0xca0 [ext4] ext4_ext_map_blocks+0x1a6b/0x27b0 [ext4] ext4_map_blocks+0x1a9/0x950 [ext4] _ext4_get_block+0xfc/0x270 [ext4] ext4_get_block_unwritten+0x33/0x50 [ext4] __block_write_begin_int+0x22e/0xae0 __block_write_begin+0x39/0x50 ext4_write_begin+0x388/0xb50 [ext4] ext4_da_write_begin+0x35f/0x8f0 [ext4] generic_perform_write+0x15d/0x290 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff9a00d4b982d0 of 8 bytes by task 8 on cpu 65: ext4_do_update_inode+0x4a0/0xf60 [ext4] ext4_inode_blocks_set at fs/ext4/inode.c:4815 ext4_mark_iloc_dirty+0xaf/0x160 [ext4] ext4_mark_inode_dirty+0x129/0x3e0 [ext4] ext4_convert_unwritten_extents+0x253/0x2d0 [ext4] ext4_convert_unwritten_io_end_vec+0xc5/0x150 [ext4] ext4_end_io_rsv_work+0x22c/0x350 [ext4] process_one_work+0x54f/0xb90 worker_thread+0x80/0x5f0 kthread+0x1cd/0x1f0 ret_from_fork+0x27/0x50 4 locks held by kworker/u256:0/8: #0: ffff9a025abc4328 ((wq_completion)ext4-rsv-conversion){+.+.}, at: process_one_work+0x443/0xb90 #1: ffffab5a862dbe20 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}, at: process_one_work+0x443/0xb90 #2: ffff9a025a9d0f58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2] #3: ffff9a00d4b985d8 (&(&ei->i_raw_lock)->rlock){+.+.}, at: ext4_do_update_inode+0xaa/0xf60 [ext4] irq event stamp: 3009267 hardirqs last enabled at (3009267): [<ffffffff980da9b7>] __find_get_block+0x107/0x790 hardirqs last disabled at (3009266): [<ffffffff980da8f9>] __find_get_block+0x49/0x790 softirqs last enabled at (3009230): [<ffffffff98a0034c>] __do_softirq+0x34c/0x57c softirqs last disabled at (3009223): [<ffffffff97cc67a2>] irq_exit+0xa2/0xc0 Reported by Kernel Concurrency Sanitizer on: CPU: 65 PID: 8 Comm: kworker/u256:0 Tainted: G L 5.6.0-rc2-next-20200221+ #7 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work [ext4] The plain read is outside of inode->i_lock critical section which results in a data race. Fix it by adding READ_ONCE() there. Link: https://lore.kernel.org/r/20200222043258.2279-1-cai@lca.pw Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-05ext4: potential crash on allocation error in ext4_alloc_flex_bg_array()Dan Carpenter
commit 37b0b6b8b99c0e1c1f11abbe7cf49b6d03795b3f upstream. If sbi->s_flex_groups_allocated is zero and the first allocation fails then this code will crash. The problem is that "i--" will set "i" to -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1 is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX is more than zero, the condition is true so we call kvfree(new_groups[-1]). The loop will carry on freeing invalid memory until it crashes. Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access") Reviewed-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix race between writepages and enabling EXT4_EXTENTS_FLEric Biggers
commit cb85f4d23f794e24127f3e562cb3b54b0803f456 upstream. If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running on it, the following warning in ext4_add_complete_io() can be hit: WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120 Here's a minimal reproducer (not 100% reliable) (root isn't required): while true; do sync done & while true; do rm -f file touch file chattr -e file echo X >> file chattr +e file done The problem is that in ext4_writepages(), ext4_should_dioread_nolock() (which only returns true on extent-based files) is checked once to set the number of reserved journal credits, and also again later to select the flags for ext4_map_blocks() and copy the reserved journal handle to ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set, the first check can see dioread_nolock disabled while the later one can see it enabled, causing the reserved handle to unexpectedly be NULL. Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races related to doing so as well, fix this by synchronizing changing EXT4_EXTENTS_FL with ext4_writepages() via the existing s_writepages_rwsem (previously called s_journal_flag_rwsem). This was originally reported by syzbot without a reproducer at https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf, but now that dioread_nolock is the default I also started seeing this when running syzkaller locally. Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io") Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: rename s_journal_flag_rwsem to s_writepages_rwsemEric Biggers
commit bbd55937de8f2754adc5792b0f8e5ff7d9c0420e upstream. In preparation for making s_journal_flag_rwsem synchronize ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA flags (rather than just JOURNAL_DATA as it does currently), rename it to s_writepages_rwsem. Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix mount failure with quota configured as moduleJan Kara
commit 9db176bceb5c5df4990486709da386edadc6bd1d upstream. When CONFIG_QFMT_V2 is configured as a module, the test in ext4_feature_set_ok() fails and so mount of filesystems with quota or project features fails. Fix the test to use IS_ENABLED macro which works properly even for modules. Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz Fixes: d65d87a07476 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix potential race between s_flex_groups online resizing and accessSuraj Jitindar Singh
commit 7c990728b99ed6fbe9c75fc202fce1172d9916da upstream. During an online resize an array of s_flex_groups structures gets replaced so it can get enlarged. If there is a concurrent access to the array and this memory has been reused then this can lead to an invalid memory access. The s_flex_group array has been converted into an array of pointers rather than an array of structures. This is to ensure that the information contained in the structures cannot get out of sync during a resize due to an accessor updating the value in the old structure after it has been copied but before the array pointer is updated. Since the structures them- selves are no longer copied but only the pointers to them this case is mitigated. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix potential race between s_group_info online resizing and accessSuraj Jitindar Singh
commit df3da4ea5a0fc5d115c90d5aa6caa4dd433750a7 upstream. During an online resize an array of pointers to s_group_info gets replaced so it can get enlarged. If there is a concurrent access to the array in ext4_get_group_info() and this memory has been reused then this can lead to an invalid memory access. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu Signed-off-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Balbir Singh <sblbir@amazon.com> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix potential race between online resizing and write operationsTheodore Ts'o
commit 1d0c3924a92e69bfa91163bda83c12a994b4d106 upstream. During an online resize an array of pointers to buffer heads gets replaced so it can get enlarged. If there is a racing block allocation or deallocation which uses the old array, and the old array has gotten reused this can lead to a GPF or some other random kernel memory getting modified. Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443 Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu Reported-by: Suraj Jitindar Singh <surajjs@amazon.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: add cond_resched() to __ext4_find_entry()Shijie Luo
commit 9424ef56e13a1f14c57ea161eed3ecfdc7b2770e upstream. We tested a soft lockup problem in linux 4.19 which could also be found in linux 5.x. When dir inode takes up a large number of blocks, and if the directory is growing when we are searching, it's possible the restart branch could be called many times, and the do while loop could hold cpu a long time. Here is the call trace in linux 4.19. [ 473.756186] Call trace: [ 473.756196] dump_backtrace+0x0/0x198 [ 473.756199] show_stack+0x24/0x30 [ 473.756205] dump_stack+0xa4/0xcc [ 473.756210] watchdog_timer_fn+0x300/0x3e8 [ 473.756215] __hrtimer_run_queues+0x114/0x358 [ 473.756217] hrtimer_interrupt+0x104/0x2d8 [ 473.756222] arch_timer_handler_virt+0x38/0x58 [ 473.756226] handle_percpu_devid_irq+0x90/0x248 [ 473.756231] generic_handle_irq+0x34/0x50 [ 473.756234] __handle_domain_irq+0x68/0xc0 [ 473.756236] gic_handle_irq+0x6c/0x150 [ 473.756238] el1_irq+0xb8/0x140 [ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4] [ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4] [ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4] [ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4] [ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4] [ 473.756402] ext4_lookup+0x8c/0x258 [ext4] [ 473.756407] __lookup_hash+0x8c/0xe8 [ 473.756411] filename_create+0xa0/0x170 [ 473.756413] do_mkdirat+0x6c/0x140 [ 473.756415] __arm64_sys_mkdirat+0x28/0x38 [ 473.756419] el0_svc_common+0x78/0x130 [ 473.756421] el0_svc_handler+0x38/0x78 [ 473.756423] el0_svc+0x8/0xc [ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149] Add cond_resched() to avoid soft lockup and to provide a better system responding. Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-28ext4: fix a data race in EXT4_I(inode)->i_disksizeQian Cai
commit 35df4299a6487f323b0aca120ea3f485dfee2ae3 upstream. EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4] write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127: ext4_write_end+0x4e3/0x750 [ext4] ext4_update_i_disksize at fs/ext4/ext4.h:3032 (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046 (inlined by) ext4_write_end at fs/ext4/inode.c:1287 generic_perform_write+0x208/0x2a0 ext4_buffered_write_iter+0x11f/0x210 [ext4] ext4_file_write_iter+0xce/0x9e0 [ext4] new_sync_write+0x29c/0x3b0 __vfs_write+0x92/0xa0 vfs_write+0x103/0x260 ksys_write+0x9d/0x130 __x64_sys_write+0x4c/0x60 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37: ext4_writepages+0x10ac/0x1d00 [ext4] mpage_map_and_submit_extent at fs/ext4/inode.c:2468 (inlined by) ext4_writepages at fs/ext4/inode.c:2772 do_writepages+0x5e/0x130 __writeback_single_inode+0xeb/0xb20 writeback_sb_inodes+0x429/0x900 __writeback_inodes_wb+0xc4/0x150 wb_writeback+0x4bd/0x870 wb_workfn+0x6b4/0x960 process_one_work+0x54c/0xbe0 worker_thread+0x80/0x650 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 Reported by Kernel Concurrency Sanitizer on: CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5 Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019 Workqueue: writeback wb_workfn (flush-7:0) Since only the read is operating as lockless (outside of the "i_data_sem"), load tearing could introduce a logic bug. Fix it by adding READ_ONCE() for the read and WRITE_ONCE() for the write. Signed-off-by: Qian Cai <cai@lca.pw> Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-24ext4: fix deadlock allocating bio_post_read_ctx from mempoolEric Biggers
[ Upstream commit 68e45330e341dad2d3a0a3f8ef2ec46a2a0a3bbc ] Without any form of coordination, any case where multiple allocations from the same mempool are needed at a time to make forward progress can deadlock under memory pressure. This is the case for struct bio_post_read_ctx, as one can be allocated to decrypt a Merkle tree page during fsverity_verify_bio(), which itself is running from a post-read callback for a data bio which has its own struct bio_post_read_ctx. Fix this by freeing the first bio_post_read_ctx before calling fsverity_verify_bio(). This works because verity (if enabled) is always the last post-read step. This deadlock can be reproduced by trying to read from an encrypted verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching mempool_alloc() to pretend that pool->alloc() always fails. Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually hit this bug in practice would require reading from lots of encrypted verity files at the same time. But it's theoretically possible, as N available objects isn't enough to guarantee forward progress when > N/2 threads each need 2 objects at a time. Fixes: 22cfe4b48ccb ("ext4: add fs-verity read support") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181222.47684-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-24ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAITRitesh Harjani
[ Upstream commit f629afe3369e9885fd6e9cc7a4f514b6a65cf9e9 ] Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. So change our dax read/write methods to just do the trylock for the RWF_NOWAIT case. This seems to fix AIM7 regression in some scalable filesystems upto ~25% in some cases. Claimed in commit 942491c9e6d6 ("xfs: fix AIM7 regression") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-19ext4: choose hardlimit when softlimit is larger than hardlimit in ↵Chengguang Xu
ext4_statfs_project() [ Upstream commit 57c32ea42f8e802bda47010418e25043e0c9337f ] Setting softlimit larger than hardlimit seems meaningless for disk quota but currently it is allowed. In this case, there may be a bit of comfusion for users when they run df comamnd to directory which has project quota. For example, we set 20M softlimit and 10M hardlimit of block usage limit for project quota of test_dir(project id 123). [root@hades mnt_ext4]# repquota -P -a *** Report for project quotas on device /dev/loop0 Block grace time: 7days; Inode grace time: 7days Block limits File limits Project used soft hard grace used soft hard grace ---------------------------------------------------------------------- 0 -- 13 0 0 2 0 0 123 -- 10237 20480 10240 5 200 100 The result of df command as below: [root@hades mnt_ext4]# df -h test_dir Filesystem Size Used Avail Use% Mounted on /dev/loop0 20M 10M 10M 50% /home/cgxu/test/mnt_ext4 Even though it looks like there is another 10M free space to use, if we write new data to diretory test_dir(inherit project id), the write will fail with errno(-EDQUOT). After this patch, the df result looks like below. [root@hades mnt_ext4]# df -h test_dir Filesystem Size Used Avail Use% Mounted on /dev/loop0 10M 10M 3.0K 100% /home/cgxu/test/mnt_ext4 Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191016022501.760-1-cgxu519@mykernel.net Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-02-19ext4: improve explanation of a mount failure caused by a misconfigured kernelTheodore Ts'o
commit d65d87a07476aa17df2dcb3ad18c22c154315bec upstream. If CONFIG_QFMT_V2 is not enabled, but CONFIG_QUOTA is enabled, when a user tries to mount a file system with the quota or project quota enabled, the kernel will emit a very confusing messsage: EXT4-fs warning (device vdc): ext4_enable_quotas:5914: Failed to enable quota tracking (type=0, err=-3). Please run e2fsck to fix. EXT4-fs (vdc): mount failed We will now report an explanatory message indicating which kernel configuration options have to be enabled, to avoid customer/sysadmin confusion. Link: https://lore.kernel.org/r/20200215012738.565735-1-tytso@mit.edu Google-Bug-Id: 149093531 Fixes: 7c319d328505b778 ("ext4: make quota as first class supported feature") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ext4: add cond_resched() to ext4_protect_reserved_inodeShijie Luo
commit af133ade9a40794a37104ecbcc2827c0ea373a3c upstream. When journal size is set too big by "mkfs.ext4 -J size=", or when we mount a crafted image to make journal inode->i_size too big, the loop, "while (i < num)", holds cpu too long. This could cause soft lockup. [ 529.357541] Call trace: [ 529.357551] dump_backtrace+0x0/0x198 [ 529.357555] show_stack+0x24/0x30 [ 529.357562] dump_stack+0xa4/0xcc [ 529.357568] watchdog_timer_fn+0x300/0x3e8 [ 529.357574] __hrtimer_run_queues+0x114/0x358 [ 529.357576] hrtimer_interrupt+0x104/0x2d8 [ 529.357580] arch_timer_handler_virt+0x38/0x58 [ 529.357584] handle_percpu_devid_irq+0x90/0x248 [ 529.357588] generic_handle_irq+0x34/0x50 [ 529.357590] __handle_domain_irq+0x68/0xc0 [ 529.357593] gic_handle_irq+0x6c/0x150 [ 529.357595] el1_irq+0xb8/0x140 [ 529.357599] __ll_sc_atomic_add_return_acquire+0x14/0x20 [ 529.357668] ext4_map_blocks+0x64/0x5c0 [ext4] [ 529.357693] ext4_setup_system_zone+0x330/0x458 [ext4] [ 529.357717] ext4_fill_super+0x2170/0x2ba8 [ext4] [ 529.357722] mount_bdev+0x1a8/0x1e8 [ 529.357746] ext4_mount+0x44/0x58 [ext4] [ 529.357748] mount_fs+0x50/0x170 [ 529.357752] vfs_kern_mount.part.9+0x54/0x188 [ 529.357755] do_mount+0x5ac/0xd78 [ 529.357758] ksys_mount+0x9c/0x118 [ 529.357760] __arm64_sys_mount+0x28/0x38 [ 529.357764] el0_svc_common+0x78/0x130 [ 529.357766] el0_svc_handler+0x38/0x78 [ 529.357769] el0_svc+0x8/0xc [ 541.356516] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [mount:18674] Link: https://lore.kernel.org/r/20200211011752.29242-1-luoshijie1@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Shijie Luo <luoshijie1@huawei.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ext4: fix checksum errors with indexed dirsJan Kara
commit 48a34311953d921235f4d7bbd2111690d2e469cf upstream. DIR_INDEX has been introduced as a compat ext4 feature. That means that even kernels / tools that don't understand the feature may modify the filesystem. This works because for kernels not understanding indexed dir format, internal htree nodes appear just as empty directory entries. Index dir aware kernels then check the htree structure is still consistent before using the data. This all worked reasonably well until metadata checksums were introduced. The problem is that these effectively made DIR_INDEX only ro-compatible because internal htree nodes store checksums in a different place than normal directory blocks. Thus any modification ignorant to DIR_INDEX (or just clearing EXT4_INDEX_FL from the inode) will effectively cause checksum mismatch and trigger kernel errors. So we have to be more careful when dealing with indexed directories on filesystems with checksumming enabled. 1) We just disallow loading any directory inodes with EXT4_INDEX_FL when DIR_INDEX is not enabled. This is harsh but it should be very rare (it means someone disabled DIR_INDEX on existing filesystem and didn't run e2fsck), e2fsck can fix the problem, and we don't want to answer the difficult question: "Should we rather corrupt the directory more or should we ignore that DIR_INDEX feature is not set?" 2) When we find out htree structure is corrupted (but the filesystem and the directory should in support htrees), we continue just ignoring htree information for reading but we refuse to add new entries to the directory to avoid corrupting it more. Link: https://lore.kernel.org/r/20200210144316.22081-1-jack@suse.cz Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ext4: fix support for inode sizes > 1024 bytesTheodore Ts'o
commit 4f97a68192bd33b9963b400759cef0ca5963af00 upstream. A recent commit, 9803387c55f7 ("ext4: validate the debug_want_extra_isize mount option at parse time"), moved mount-time checks around. One of those changes moved the inode size check before the blocksize variable was set to the blocksize of the file system. After 9803387c55f7 was set to the minimum allowable blocksize, which in practice on most systems would be 1024 bytes. This cuased file systems with inode sizes larger than 1024 bytes to be rejected with a message: EXT4-fs (sdXX): unsupported inode size: 4096 Fixes: 9803387c55f7 ("ext4: validate the debug_want_extra_isize mount option at parse time") Link: https://lore.kernel.org/r/20200206225252.GA3673@mit.edu Reported-by: Herbert Poetzl <herbert@13thfloor.at> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-19ext4: don't assume that mmp_nodename/bdevname have NULAndreas Dilger
commit 14c9ca0583eee8df285d68a0e6ec71053efd2228 upstream. Don't assume that the mmp_nodename and mmp_bdevname strings are NUL terminated, since they are filled in by snprintf(), which is not guaranteed to do so. Link: https://lore.kernel.org/r/1580076215-1048-1-git-send-email-adilger@dilger.ca Signed-off-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11ext4: fix race conditions in ->d_compare() and ->d_hash()Eric Biggers
commit ec772f01307a2c06ebf6cdd221e6b518a71ddae7 upstream. Since ->d_compare() and ->d_hash() can be called in RCU-walk mode, ->d_parent and ->d_inode can be concurrently modified, and in particular, ->d_inode may be changed to NULL. For ext4_d_hash() this resulted in a reproducible NULL dereference if a lookup is done in a directory being deleted, e.g. with: int main() { if (fork()) { for (;;) { mkdir("subdir", 0700); rmdir("subdir"); } } else { for (;;) access("subdir/file", 0); } } ... or by running the 't_encrypted_d_revalidate' program from xfstests. Both repros work in any directory on a filesystem with the encoding feature, even if the directory doesn't actually have the casefold flag. I couldn't reproduce a crash in ext4_d_compare(), but it appears that a similar crash is possible there. Fix these bugs by reading ->d_parent and ->d_inode using READ_ONCE() and falling back to the case sensitive behavior if the inode is NULL. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: b886ee3e778e ("ext4: Support case-insensitive file name lookups") Cc: <stable@vger.kernel.org> # v5.2+ Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20200124041234.159740-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-02-11ext4: fix deadlock allocating crypto bounce page from mempoolEric Biggers
commit 547c556f4db7c09447ecf5f833ab6aaae0c5ab58 upstream. ext4_writepages() on an encrypted file has to encrypt the data, but it can't modify the pagecache pages in-place, so it encrypts the data into bounce pages and writes those instead. All bounce pages are allocated from a mempool using GFP_NOFS. This is not correct use of a mempool, and it can deadlock. This is because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never fail" mode for mempool_alloc() where a failed allocation will fall back to waiting for one of the preallocated elements in the pool. But since this mode is used for all a bio's pages and not just the first, it can deadlock waiting for pages already in the bio to be freed. This deadlock can be reproduced by patching mempool_alloc() to pretend that pool->alloc() always fails (so that it always falls back to the preallocations), and then creating an encrypted file of size > 128 KiB. Fix it by only using GFP_NOFS for the first page in the bio. For subsequent pages just use GFP_NOWAIT, and if any of those fail, just submit the bio and start a new one. This will need to be fixed in f2fs too, but that's less straightforward. Fixes: c9af28fdd449 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-01-04ext4: iomap that extends beyond EOF should be marked dirtyMatthew Bobrowski
[ Upstream commit 2e9b51d78229d5145725a481bb5464ebc0a3f9b2 ] This patch addresses what Dave Chinner had discovered and fixed within commit: 7684e2c4384d. This changes does not have any user visible impact for ext4 as none of the current users of ext4_iomap_begin() that extend files depend on IOMAP_F_DIRTY. When doing a direct IO that spans the current EOF, and there are written blocks beyond EOF that extend beyond the current write, the only metadata update that needs to be done is a file size extension. However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that there is IO completion metadata updates required, and hence we may fail to correctly sync file size extensions made in IO completion when O_DSYNC writes are being used and the hardware supports FUA. Hence when setting IOMAP_F_DIRTY, we need to also take into account whether the iomap spans the current EOF. If it does, then we need to mark it dirty so that IO completion will call generic_write_sync() to flush the inode size update to stable storage correctly. Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/8b43ee9ee94bee5328da56ba0909b7d2229ef150.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2020-01-04ext4: update direct I/O read lock pattern for IOCB_NOWAITMatthew Bobrowski
[ Upstream commit 548feebec7e93e58b647dba70b3303dcb569c914 ] This patch updates the lock pattern in ext4_direct_IO_read() to not block on inode lock in cases of IOCB_NOWAIT direct I/O reads. The locking condition implemented here is similar to that of 942491c9e6d6 ("xfs: fix AIM7 regression"). Fixes: 16c54688592c ("ext4: Allow parallel DIO reads") Signed-off-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/c5d5e759f91747359fbd2c6f9a36240cf75ad79f.1572949325.git.mbobrowski@mbobrowski.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-12-31ext4: validate the debug_want_extra_isize mount option at parse timeTheodore Ts'o
commit 9803387c55f7d2ce69aa64340c5fdc6b3027dbc8 upstream. Instead of setting s_want_extra_size and then making sure that it is a valid value afterwards, validate the field before we set it. This avoids races and other problems when remounting the file system. Link: https://lore.kernel.org/r/20191215063020.GA11512@mit.edu Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reported-and-tested-by: syzbot+4a39a025912b265cacef@syzkaller.appspotmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31ext4: unlock on error in ext4_expand_extra_isize()Dan Carpenter
commit 7f420d64a08c1dcd65b27be82a27cf2bdb2e7847 upstream. We need to unlock the xattr before returning on this error path. Cc: stable@kernel.org # 4.13 Fixes: c03b45b853f5 ("ext4, project: expand inode extra size if possible") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20191213185010.6k7yl2tck3wlsdkt@kili.mountain Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31ext4: check for directory entries too close to block endJan Kara
commit 109ba779d6cca2d519c5dd624a3276d03e21948e upstream. ext4_check_dir_entry() currently does not catch a case when a directory entry ends so close to the block end that the header of the next directory entry would not fit in the remaining space. This can lead to directory iteration code trying to access address beyond end of current buffer head leading to oops. CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-3-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-31ext4: fix ext4_empty_dir() for directories with holesJan Kara
commit 64d4ce892383b2ad6d782e080d25502f91bf2a38 upstream. Function ext4_empty_dir() doesn't correctly handle directories with holes and crashes on bh->b_data dereference when bh is NULL. Reorganize the loop to use 'offset' variable all the times instead of comparing pointers to current direntry with bh->b_data pointer. Also add more strict checking of '.' and '..' directory entries to avoid entering loop in possibly invalid state on corrupted filesystems. References: CVE-2019-19037 CC: stable@vger.kernel.org Fixes: 4e19d6b65fb4 ("ext4: allow directory holes") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191202170213.4761-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17ext4: fix leak of quota reservationsJan Kara
commit f4c2d372b89a1e504ebb7b7eb3e29b8306479366 upstream. Commit 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages") moved freeing of delayed allocation reservations from dirty page invalidation time to time when we evict corresponding status extent from extent status tree. For inodes which don't have any blocks allocated this may actually happen only in ext4_clear_blocks() which is after we've dropped references to quota structures from the inode. Thus reservation of quota leaked. Fix the problem by clearing quota information from the inode only after evicting extent status tree in ext4_clear_inode(). Link: https://lore.kernel.org/r/20191108115420.GI20863@quack2.suse.cz Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Fixes: 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages") Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17ext4: fix a bug in ext4_wait_for_tail_page_commityangerkun
commit 565333a1554d704789e74205989305c811fd9c7a upstream. No need to wait for any commit once the page is fully truncated. Besides, it may confuse e.g. concurrent ext4_writepage() with the page still be dirty (will be cleared by truncate_pagecache() in ext4_setattr()) but buffers has been freed; and then trigger a bug show as below: [ 26.057508] ------------[ cut here ]------------ [ 26.058531] kernel BUG at fs/ext4/inode.c:2134! ... [ 26.088130] Call trace: [ 26.088695] ext4_writepage+0x914/0xb28 [ 26.089541] writeout.isra.4+0x1b4/0x2b8 [ 26.090409] move_to_new_page+0x3b0/0x568 [ 26.091338] __unmap_and_move+0x648/0x988 [ 26.092241] unmap_and_move+0x48c/0xbb8 [ 26.093096] migrate_pages+0x220/0xb28 [ 26.093945] kernel_mbind+0x828/0xa18 [ 26.094791] __arm64_sys_mbind+0xc8/0x138 [ 26.095716] el0_svc_common+0x190/0x490 [ 26.096571] el0_svc_handler+0x60/0xd0 [ 26.097423] el0_svc+0x8/0xc Run the procedure (generate by syzkaller) parallel with ext3. void main() { int fd, fd1, ret; void *addr; size_t length = 4096; int flags; off_t offset = 0; char *str = "12345"; fd = open("a", O_RDWR | O_CREAT); assert(fd >= 0); /* Truncate to 4k */ ret = ftruncate(fd, length); assert(ret == 0); /* Journal data mode */ flags = 0xc00f; ret = ioctl(fd, _IOW('f', 2, long), &flags); assert(ret == 0); /* Truncate to 0 */ fd1 = open("a", O_TRUNC | O_NOATIME); assert(fd1 >= 0); addr = mmap(NULL, length, PROT_WRITE | PROT_READ, MAP_SHARED, fd, offset); assert(addr != (void *)-1); memcpy(addr, str, 5); mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE); } And the bug will be triggered once we seen the below order. reproduce1 reproduce2 ... | ... truncate to 4k | change to journal data mode | | memcpy(set page dirty) truncate to 0: | ext4_setattr: | ... | ext4_wait_for_tail_page_commit | | mbind(trigger bug) truncate_pagecache(clean dirty)| ... ... | mbind will call ext4_writepage() since the page still be dirty, and then report the bug since the buffers has been free. Fix it by return directly once offset equals to 0 which means the page has been fully truncated. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: yangerkun <yangerkun@huawei.com> Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17ext4: work around deleting a file with i_nlink == 0 safelyTheodore Ts'o
commit c7df4a1ecb8579838ec8c56b2bb6a6716e974f37 upstream. If the file system is corrupted such that a file's i_links_count is too small, then it's possible that when unlinking that file, i_nlink will already be zero. Previously we were working around this kind of corruption by forcing i_nlink to one; but we were doing this before trying to delete the directory entry --- and if the file system is corrupted enough that ext4_delete_entry() fails, then we exit with i_nlink elevated, and this causes the orphan inode list handling to be FUBAR'ed, such that when we unmount the file system, the orphan inode list can get corrupted. A better way to fix this is to simply skip trying to call drop_nlink() if i_nlink is already zero, thus moving the check to the place where it makes the most sense. https://bugzilla.kernel.org/show_bug.cgi?id=205433 Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-17ext4: Fix credit estimate for final inode freeingJan Kara
commit 65db869c754e7c271691dd5feabf884347e694f5 upstream. Estimate for the number of credits needed for final freeing of inode in ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for orphan deletion, bitmap & group descriptor for inode freeing) and not just 3. [ Fixed minor whitespace nit. -- TYT ] Fixes: e50e5129f384 ("ext4: xattr-in-inode support") CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191105164437.32602-6-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-12-04ext4: add more paranoia checking in ext4_expand_extra_isize handlingTheodore Ts'o
commit 4ea99936a1630f51fc3a2d61a58ec4a1c4b7d55a upstream. It's possible to specify a non-zero s_want_extra_isize via debugging option, and this can cause bad things(tm) to happen when using a file system with an inode size of 128 bytes. Add better checking when the file system is mounted, as well as when we are actually doing the trying to do the inode expansion. Link: https://lore.kernel.org/r/20191110121510.GH23325@mit.edu Reported-by: syzbot+f8d6f8386ceacdbfff57@syzkaller.appspotmail.com Reported-by: syzbot+33d7ea72e47de3bdf4e1@syzkaller.appspotmail.com Reported-by: syzbot+44b6763edfc17144296f@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-09-29Merge branch 'entropy'Linus Torvalds
Merge active entropy generation updates. This is admittedly partly "for discussion". We need to have a way forward for the boot time deadlocks where user space ends up waiting for more entropy, but no entropy is forthcoming because the system is entirely idle just waiting for something to happen. While this was triggered by what is arguably a user space bug with GDM/gnome-session asking for secure randomness during early boot, when they didn't even need any such truly secure thing, the issue ends up being that our "getrandom()" interface is prone to that kind of confusion, because people don't think very hard about whether they want to block for sufficient amounts of entropy. The approach here-in is to decide to not just passively wait for entropy to happen, but to start actively collecting it if it is missing. This is not necessarily always possible, but if the architecture has a CPU cycle counter, there is a fair amount of noise in the exact timings of reasonably complex loads. We may end up tweaking the load and the entropy estimates, but this should be at least a reasonable starting point. As part of this, we also revert the revert of the ext4 IO pattern improvement that ended up triggering the reported lack of external entropy. * getrandom() active entropy waiting: Revert "Revert "ext4: make __ext4_get_inode_loc plug"" random: try to actively add entropy rather than passively wait for it
2019-09-29Revert "Revert "ext4: make __ext4_get_inode_loc plug""Linus Torvalds
This reverts commit 72dbcf72156641fde4d8ea401e977341bfd35a05. Instead of waiting forever for entropy that may just not happen, we now try to actively generate entropy when required, and are thus hopefully avoiding the problem that caused the nice ext4 IO pattern fix to be reverted. So revert the revert. Cc: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-21Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Added new ext4 debugging ioctls to allow userspace to get information about the state of the extent status cache. Dropped workaround for pre-1970 dates which were encoded incorrectly in pre-4.4 kernels. Since both the kernel correctly generates, and e2fsck detects and fixes this issue for the past four years, it'e time to drop the workaround. (Also, it's not like files with dates in the distant past were all that common in the first place.) A lot of miscellaneous bug fixes and cleanups, including some ext4 Documentation fixes. Also included are two minor bug fixes in fs/unicode" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (21 commits) unicode: make array 'token' static const, makes object smaller unicode: Move static keyword to the front of declarations ext4: add missing bigalloc documentation. ext4: fix kernel oops caused by spurious casefold flag ext4: fix integer overflow when calculating commit interval ext4: use percpu_counters for extent_status cache hits/misses ext4: fix potential use after free after remounting with noblock_validity jbd2: add missing tracepoint for reserved handle ext4: fix punch hole for inline_data file systems ext4: rework reserved cluster accounting when invalidating pages ext4: documentation fixes ext4: treat buffers with write errors as containing valid data ext4: fix warning inside ext4_convert_unwritten_extents_endio ext4: set error return correctly when ext4_htree_store_dirent fails ext4: drop legacy pre-1970 encoding workaround ext4: add new ioctl EXT4_IOC_GET_ES_CACHE ext4: add a new ioctl EXT4_IOC_GETSTATE ext4: add a new ioctl EXT4_IOC_CLEAR_ES_CACHE jbd2: flush_descriptor(): Do not decrease buffer head's ref count ext4: remove unnecessary error check ...
2019-09-19Merge tag 'y2038-vfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground Pull y2038 vfs updates from Arnd Bergmann: "Add inode timestamp clamping. This series from Deepa Dinamani adds a per-superblock minimum/maximum timestamp limit for a file system, and clamps timestamps as they are written, to avoid random behavior from integer overflow as well as having different time stamps on disk vs in memory. At mount time, a warning is now printed for any file system that can represent current timestamps but not future timestamps more than 30 years into the future, similar to the arbitrary 30 year limit that was added to settimeofday(). This was picked as a compromise to warn users to migrate to other file systems (e.g. ext4 instead of ext3) when they need the file system to survive beyond 2038 (or similar limits in other file systems), but not get in the way of normal usage" * tag 'y2038-vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground: ext4: Reduce ext4 timestamp warnings isofs: Initialize filesystem timestamp ranges pstore: fs superblock limits fs: omfs: Initialize filesystem timestamp ranges fs: hpfs: Initialize filesystem timestamp ranges fs: ceph: Initialize filesystem timestamp ranges fs: sysv: Initialize filesystem timestamp ranges fs: affs: Initialize filesystem timestamp ranges fs: fat: Initialize filesystem timestamp ranges fs: cifs: Initialize filesystem timestamp ranges fs: nfs: Initialize filesystem timestamp ranges ext4: Initialize timestamps limits 9p: Fill min and max timestamps in sb fs: Fill in max and min timestamps in superblock utimes: Clamp the timestamps before update mount: Add mount warning for impending timestamp expiry timestamp_truncate: Replace users of timespec64_trunc vfs: Add timestamp_truncate() api vfs: Add file timestamp range support
2019-09-18Merge tag 'fsverity-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt Pull fs-verity support from Eric Biggers: "fs-verity is a filesystem feature that provides Merkle tree based hashing (similar to dm-verity) for individual readonly files, mainly for the purpose of efficient authenticity verification. This pull request includes: (a) The fs/verity/ support layer and documentation. (b) fs-verity support for ext4 and f2fs. Compared to the original fs-verity patchset from last year, the UAPI to enable fs-verity on a file has been greatly simplified. Lots of other things were cleaned up too. fs-verity is planned to be used by two different projects on Android; most of the userspace code is in place already. Another userspace tool ("fsverity-utils"), and xfstests, are also available. e2fsprogs and f2fs-tools already have fs-verity support. Other people have shown interest in using fs-verity too. I've tested this on ext4 and f2fs with xfstests, both the existing tests and the new fs-verity tests. This has also been in linux-next since July 30 with no reported issues except a couple minor ones I found myself and folded in fixes for. Ted and I will be co-maintaining fs-verity" * tag 'fsverity-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: f2fs: add fs-verity support ext4: update on-disk format documentation for fs-verity ext4: add fs-verity read support ext4: add basic fs-verity support fs-verity: support builtin file signatures fs-verity: add SHA-512 support fs-verity: implement FS_IOC_MEASURE_VERITY ioctl fs-verity: implement FS_IOC_ENABLE_VERITY ioctl fs-verity: add data verification hooks for ->readpages() fs-verity: add the hook for file ->setattr() fs-verity: add the hook for file ->open() fs-verity: add inode and superblock fields fs-verity: add Kconfig and the helper functions for hashing fs: uapi: define verity bit for FS_IOC_GETFLAGS fs-verity: add UAPI header fs-verity: add MAINTAINERS file entry fs-verity: add a documentation file
2019-09-18Merge tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscryptLinus Torvalds
Pull fscrypt updates from Eric Biggers: "This is a large update to fs/crypto/ which includes: - Add ioctls that add/remove encryption keys to/from a filesystem-level keyring. These fix user-reported issues where e.g. an encrypted home directory can break NetworkManager, sshd, Docker, etc. because they don't get access to the needed keyring. These ioctls also provide a way to lock encrypted directories that doesn't use the vm.drop_caches sysctl, so is faster, more reliable, and doesn't always need root. - Add a new encryption policy version ("v2") which switches to a more standard, secure, and flexible key derivation function, and starts verifying that the correct key was supplied before using it. The key derivation improvement is needed for its own sake as well as for ongoing feature work for which the current way is too inflexible. Work is in progress to update both Android and the 'fscrypt' userspace tool to use both these features. (Working patches are available and just need to be reviewed+merged.) Chrome OS will likely use them too. This has also been tested on ext4, f2fs, and ubifs with xfstests -- both the existing encryption tests, and the new tests for this. This has also been in linux-next since Aug 16 with no reported issues. I'm also using an fscrypt v2-encrypted home directory on my personal desktop" * tag 'fscrypt-for-linus' of git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt: (27 commits) ext4 crypto: fix to check feature status before get policy fscrypt: document the new ioctls and policy version ubifs: wire up new fscrypt ioctls f2fs: wire up new fscrypt ioctls ext4: wire up new fscrypt ioctls fscrypt: require that key be added when setting a v2 encryption policy fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY_ALL_USERS ioctl fscrypt: allow unprivileged users to add/remove keys for v2 policies fscrypt: v2 encryption policy support fscrypt: add an HKDF-SHA512 implementation fscrypt: add FS_IOC_GET_ENCRYPTION_KEY_STATUS ioctl fscrypt: add FS_IOC_REMOVE_ENCRYPTION_KEY ioctl fscrypt: add FS_IOC_ADD_ENCRYPTION_KEY ioctl fscrypt: rename keyinfo.c to keysetup.c fscrypt: move v1 policy key setup to keysetup_v1.c fscrypt: refactor key setup code in preparation for v2 policies fscrypt: rename fscrypt_master_key to fscrypt_direct_key fscrypt: add ->ci_inode to fscrypt_info fscrypt: use FSCRYPT_* definitions, not FS_* fscrypt: use FSCRYPT_ prefix for uapi constants ...
2019-09-15Revert "ext4: make __ext4_get_inode_loc plug"Linus Torvalds
This reverts commit b03755ad6f33b7b8cd7312a3596a2dbf496de6e7. This is sad, and done for all the wrong reasons. Because that commit is good, and does exactly what it says: avoids a lot of small disk requests for the inode table read-ahead. However, it turns out that it causes an entirely unrelated problem: the getrandom() system call was introduced back in 2014 by commit c6e9d6f38894 ("random: introduce getrandom(2) system call"), and people use it as a convenient source of good random numbers. But part of the current semantics for getrandom() is that it waits for the entropy pool to fill at least partially (unlike /dev/urandom). And at least ArchLinux apparently has a systemd that uses getrandom() at boot time, and the improvements in IO patterns means that existing installations suddenly start hanging, waiting for entropy that will never happen. It seems to be an unlucky combination of not _quite_ enough entropy, together with a particular systemd version and configuration. Lennart says that the systemd-random-seed process (which is what does this early access) is supposed to not block any other boot activity, but sadly that doesn't actually seem to be the case (possibly due bogus dependencies on cryptsetup for encrypted swapspace). The correct fix is to fix getrandom() to not block when it's not appropriate, but that fix is going to take a lot more discussion. Do we just make it act like /dev/urandom by default, and add a new flag for "wait for entropy"? Do we add a boot-time option? Or do we just limit the amount of time it will wait for entropy? So in the meantime, we do the revert to give us time to discuss the eventual fix for the fundamental problem, at which point we can re-apply the ext4 inode table access optimization. Reported-by: Ahmed S. Darwish <darwish.07@gmail.com> Cc: Ted Ts'o <tytso@mit.edu> Cc: Willy Tarreau <w@1wt.eu> Cc: Alexander E. Patrakov <patrakov@gmail.com> Cc: Lennart Poettering <mzxreary@0pointer.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-04ext4: Reduce ext4 timestamp warningsDeepa Dinamani
When ext4 file systems were created intentionally with 128 byte inodes, the rate-limited warning of eventual possible timestamp overflow are still emitted rather frequently. Remove the warning for now. Discussion for whether any warning is needed, and where it should be emitted, can be found at https://lore.kernel.org/lkml/1567523922.5576.57.camel@lca.pw/. I can post a separate follow-up patch after the conclusion. Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2019-09-03ext4: fix kernel oops caused by spurious casefold flagTheodore Ts'o
If an directory has the a casefold flag set without the casefold feature set, s_encoding will not be initialized, and this will cause the kernel to dereference a NULL pointer. In addition to adding checks to avoid these kernel oops, attempts to load inodes with the casefold flag when the casefold feature is not enable will cause the file system to be declared corrupted. Signed-off-by: Theodore Ts'o <tytso@mit.edu>