summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-01-17ext4: drop ext4_kvmalloc()Theodore Ts'o
As Jan pointed out[1], as of commit 81378da64de ("jbd2: mark the transaction context with the scope GFP_NOFS context") we use memalloc_nofs_{save,restore}() while a jbd2 handle is active. So ext4_kvmalloc() so we can call allocate using GFP_NOFS is no longer necessary. [1] https://lore.kernel.org/r/20200109100007.GC27035@quack2.suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200116155031.266620-1-tytso@mit.edu Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: Add EXT4_IOC_FSGETXATTR/EXT4_IOC_FSSETXATTR to compat_ioctlMartijn Coenen
These are backed by 'struct fsxattr' which has the same size on all architectures. Signed-off-by: Martijn Coenen <maco@android.com> Link: https://lore.kernel.org/r/20191227134639.35869-1-maco@android.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: remove unused macro MPAGE_DA_EXTENT_TAILRitesh Harjani
Remove unused macro MPAGE_DA_EXTENT_TAIL which is no more used after below commit 4e7ea81d ("ext4: restructure writeback path") Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200101095137.25656-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: add missing braces in ext4_ext_drop_refs()Eric Biggers
For clarity, add braces to the loop in ext4_ext_drop_refs(). Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-9-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: fix some nonstandard indentation in extents.cEric Biggers
Clean up some code that was using 2-character indents. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-8-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove obsolete comment from ext4_can_extents_be_merged()Eric Biggers
Support for unwritten extents was added to ext4 a long time ago, so remove a misleading comment that says they're a future feature. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-7-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: fix documentation for ext4_ext_try_to_merge()Eric Biggers
Don't mention the nonexistent return value, and mention both types of merges that are attempted. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-6-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: make some functions static in extents.cEric Biggers
Make the following functions static since they're only used in extents.c: __ext4_ext_dirty() ext4_can_extents_be_merged() ext4_collapse_range() ext4_insert_range() Also remove the prototype for ext4_ext_writepage_trans_blocks(), as this function is not defined anywhere. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-5-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove redundant S_ISREG() checks from ext4_fallocate()Eric Biggers
ext4_fallocate() is only used in the file_operations for regular files. Also, the VFS only allows fallocate() on regular files and block devices, but block devices always use blkdev_fallocate(). For both of these reasons, S_ISREG() is always true in ext4_fallocate(). Therefore the S_ISREG() checks in ext4_zero_range(), ext4_collapse_range(), ext4_insert_range(), and ext4_punch_hole() are redundant. Remove them. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-4-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: clean up len and offset checks in ext4_fallocate()Eric Biggers
- Fix some comments. - Consistently access i_size directly rather than using i_size_read(), since in all relevant cases we're under inode_lock(). - Simplify the alignment checks by using the IS_ALIGNED() macro. - In ext4_insert_range(), do the check against s_maxbytes in a way that is safe against signed overflow. (This doesn't currently matter for ext4 due to ext4's limited max file size, but this is something other filesystems have gotten wrong. We might as well do it safely.) Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-3-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove ext4_{ind,ext}_calc_metadata_amount()Eric Biggers
Remove the ext4_ind_calc_metadata_amount() and ext4_ext_calc_metadata_amount() functions, which have been unused since commit 71d4f7d03214 ("ext4: remove metadata reservation checks"). Also remove the i_da_metadata_calc_last_lblock and i_da_metadata_calc_len fields from struct ext4_inode_info, as these were only used by these removed functions. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231180444.46586-2-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove unneeded check for error allocating bio_post_read_ctxEric Biggers
Since allocating an object from a mempool never fails when __GFP_DIRECT_RECLAIM (which is included in GFP_NOFS) is set, the check for failure to allocate a bio_post_read_ctx is unnecessary. Remove it. Also remove the redundant assignment to ->bi_private. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181256.47770-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: fix deadlock allocating bio_post_read_ctx from mempoolEric Biggers
Without any form of coordination, any case where multiple allocations from the same mempool are needed at a time to make forward progress can deadlock under memory pressure. This is the case for struct bio_post_read_ctx, as one can be allocated to decrypt a Merkle tree page during fsverity_verify_bio(), which itself is running from a post-read callback for a data bio which has its own struct bio_post_read_ctx. Fix this by freeing the first bio_post_read_ctx before calling fsverity_verify_bio(). This works because verity (if enabled) is always the last post-read step. This deadlock can be reproduced by trying to read from an encrypted verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching mempool_alloc() to pretend that pool->alloc() always fails. Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually hit this bug in practice would require reading from lots of encrypted verity files at the same time. But it's theoretically possible, as N available objects isn't enough to guarantee forward progress when > N/2 threads each need 2 objects at a time. Fixes: 22cfe4b48ccb ("ext4: add fs-verity read support") Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181222.47684-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: fix deadlock allocating crypto bounce page from mempoolEric Biggers
ext4_writepages() on an encrypted file has to encrypt the data, but it can't modify the pagecache pages in-place, so it encrypts the data into bounce pages and writes those instead. All bounce pages are allocated from a mempool using GFP_NOFS. This is not correct use of a mempool, and it can deadlock. This is because GFP_NOFS includes __GFP_DIRECT_RECLAIM, which enables the "never fail" mode for mempool_alloc() where a failed allocation will fall back to waiting for one of the preallocated elements in the pool. But since this mode is used for all a bio's pages and not just the first, it can deadlock waiting for pages already in the bio to be freed. This deadlock can be reproduced by patching mempool_alloc() to pretend that pool->alloc() always fails (so that it always falls back to the preallocations), and then creating an encrypted file of size > 128 KiB. Fix it by only using GFP_NOFS for the first page in the bio. For subsequent pages just use GFP_NOWAIT, and if any of those fail, just submit the bio and start a new one. This will need to be fixed in f2fs too, but that's less straightforward. Fixes: c9af28fdd449 ("ext4 crypto: don't let data integrity writebacks fail with ENOMEM") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191231181149.47619-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: Delete ext4_kvzvalloc()Naoto Kobayashi
Since we're not using ext4_kvzalloc(), delete this function. Signed-off-by: Naoto Kobayashi <naoto.kobayashi4c@gmail.com> Link: https://lore.kernel.org/r/20191227080523.31808-2-naoto.kobayashi4c@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: re-enable extent zeroout optimization on encrypted filesEric Biggers
For encrypted files, commit 36086d43f657 ("ext4 crypto: fix bugs in ext4_encrypted_zeroout()") disabled the optimization where when a write occurs to the middle of an unwritten extent, the head and/or tail of the extent (when they aren't too large) are zeroed out, turned into an initialized extent, and merged with the part being written to. This optimization helps prevent fragmentation of the extent tree. However, disabling this optimization also made fscrypt_zeroout_range() nearly impossible to test, as now it's only reachable via the very rare case in ext4_split_extent_at() where allocating a new extent tree block fails due to ENOSPC. 'gce-xfstests -c ext4/encrypt -g auto' doesn't even hit this at all. It's preferable to avoid really rare cases that are hard to test. That commit also cited data corruption in xfstest generic/127 as a reason to disable the extent zeroout optimization, but that's no longer reproducible anymore. It also cited fscrypt_zeroout_range() having poor performance, but I've written a patch to fix that. Therefore, re-enable the extent zeroout optimization on encrypted files. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226161114.53606-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: only use fscrypt_zeroout_range() on regular filesEric Biggers
fscrypt_zeroout_range() is only for encrypted regular files, not for encrypted directories or symlinks. Fortunately, currently it seems it's never called on non-regular files. But to be safe ext4 should explicitly check S_ISREG() before calling it. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226161022.53490-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: allow ZERO_RANGE on encrypted filesEric Biggers
When ext4 encryption support was first added, ZERO_RANGE was disallowed, supposedly because test failures (e.g. ext4/001) were seen when enabling it, and at the time there wasn't enough time/interest to debug it. However, there's actually no reason why ZERO_RANGE can't work on encrypted files. And it fact it *does* work now. Whole blocks in the zeroed range are converted to unwritten extents, as usual; encryption makes no difference for that part. Partial blocks are zeroed in the pagecache and then ->writepages() encrypts those blocks as usual. ext4_block_zero_page_range() handles reading and decrypting the block if needed before actually doing the pagecache write. Also, f2fs has always supported ZERO_RANGE on encrypted files. As far as I can tell, the reason that ext4/001 was failing in v4.1 was actually because of one of the bugs fixed by commit 36086d43f657 ("ext4 crypto: fix bugs in ext4_encrypted_zeroout()"). The bug made ext4_encrypted_zeroout() always return a positive value, which caused unwritten extents in encrypted files to sometimes not be marked as initialized after being written to. This bug was not actually in ZERO_RANGE; it just happened to trigger during the extents manipulation done in ext4/001 (and probably other tests too). So, let's enable ZERO_RANGE on encrypted files on ext4. Tested with: gce-xfstests -c ext4/encrypt -g auto gce-xfstests -c ext4/encrypt_1k -g auto Got the same set of test failures both with and without this patch. But with this patch 6 fewer tests are skipped: ext4/001, generic/008, generic/009, generic/033, generic/096, and generic/511. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226154216.4808-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: handle decryption error in __ext4_block_zero_page_range()Eric Biggers
fscrypt_decrypt_pagecache_blocks() can fail, because it uses skcipher_request_alloc(), which uses kmalloc(), which can fail; and also because it calls crypto_skcipher_decrypt(), which can fail depending on the driver that actually implements the crypto. Therefore it's not appropriate to WARN on decryption error in __ext4_block_zero_page_range(). Remove the WARN and just handle the error instead. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226154105.4704-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17docs: ext4.rst: add encryption and verity to features listEric Biggers
Mention encryption and verity in the ext4 features list. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226154007.4569-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: remove unnecessary selections from EXT3_FSEric Biggers
Since EXT3_FS already selects EXT4_FS, there's no reason for it to redundantly select all the selections of EXT4_FS -- notwithstanding the comments that claim otherwise. Remove these redundant selections to avoid confusion. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191226153920.4466-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: use true,false for bool variablezhengbin
Fixes coccicheck warning: fs/ext4/extents.c:5271:6-12: WARNING: Assignment of 0/1 to bool variable fs/ext4/extents.c:5287:4-10: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/1577241959-138695-1-git-send-email-zhengbin13@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: uninline ext4_inode_journal_mode()Eric Biggers
Determining an inode's journaling mode has gotten more complicated over time. Move ext4_inode_journal_mode() from an inline function into ext4_jbd2.c to reduce the compiled code size. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191209233602.117778-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove unnecessary ifdefs in htree_dirblock_to_tree()Eric Biggers
The ifdefs for CONFIG_FS_ENCRYPTION in htree_dirblock_to_tree() are unnecessary, as the called functions are already stubbed out when !CONFIG_FS_ENCRYPTION. Remove them. Signed-off-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20191209213225.18477-1-ebiggers@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz>
2020-01-17ext4: remove unnecessary assignment in ext4_htree_store_dirent()Chengguang Xu
We have allocated memory using kzalloc() so don't have to set 0 again in last byte. Signed-off-by: Chengguang Xu <cgxu519@mykernel.net> Link: https://lore.kernel.org/r/20191206054317.3107-1-cgxu519@mykernel.net Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-01-17ext4: avoid fetching btime in ext4_getattr() unless requestedTheodore Ts'o
Linus observed that an allmodconfig build which does a lot of stat(2) calls that ext4_getattr() was a noticeable (1%) amount of CPU time, due to the cache line for i_extra_isize getting pulled in. Since the normal stat system call doesn't return btime, it's a complete waste. So only calculate btime when it is explicitly requested. [ Fixed to check against request_mask instead of query_flags. ] Link: https://lore.kernel.org/r/CAHk-=wivmk_j6KbTX+Er64mLrG8abXZo0M10PNdAnHc8fWXfsQ@mail.gmail.com Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: Optimize ext4 DIO overwritesJan Kara
Currently we start transaction for mapping every extent for writing using direct IO. This is unnecessary when we know we are overwriting already allocated blocks and the overhead of starting a transaction can be significant especially for multithreaded workloads doing small writes. Use iomap operations that avoid starting a transaction for direct IO overwrites. This improves throughput of 4k random writes - fio jobfile: [global] rw=randrw norandommap=1 invalidate=0 bs=4k numjobs=16 time_based=1 ramp_time=30 runtime=120 group_reporting=1 ioengine=psync direct=1 size=16G filename=file1.0.0:file1.0.1:file1.0.2:file1.0.3:file1.0.4:file1.0.5:file1.0.6:file1.0.7:file1.0.8:file1.0.9:file1.0.10:file1.0.11:file1.0.12:file1.0.13:file1.0.14:file1.0.15:file1.0.16:file1.0.17:file1.0.18:file1.0.19:file1.0.20:file1.0.21:file1.0.22:file1.0.23:file1.0.24:file1.0.25:file1.0.26:file1.0.27:file1.0.28:file1.0.29:file1.0.30:file1.0.31 file_service_type=random nrfiles=32 from 3018MB/s to 4059MB/s in my test VM running test against simulated pmem device (note that before iomap conversion, this workload was able to achieve 3708MB/s because old direct IO path avoided transaction start for overwrites as well). For dax, the win is even larger improving throughput from 3042MB/s to 4311MB/s. Reported-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191218174433.19380-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: export information about first/last errors via /sys/fs/ext4/<dev>Theodore Ts'o
Make {first,last}_error_{ino,block,line,func,errcode} available via sysfs. Also add a missing newline for {first,last}_error_time. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: simulate various I/O and checksum errors when reading metadataTheodore Ts'o
This allows us to test various error handling code paths Link: https://lore.kernel.org/r/20191209012317.59398-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26ext4: save the error code which triggered an ext4_error() in the superblockTheodore Ts'o
This allows the cause of an ext4_error() report to be categorized based on whether it was triggered due to an I/O error, or an memory allocation error, or other possible causes. Most errors are caused by a detected file system inconsistency, so the default code stored in the superblock will be EXT4_ERR_EFSCORRUPTED. Link: https://lore.kernel.org/r/20191204032335.7683-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-26Merge branch 'rk/inode_lock' into devTheodore Ts'o
These are ilock patches which helps improve the current inode lock scalabiliy problem in ext4 DIO mixed read/write workload case. The problem was first reported by Joseph [1]. This should help improve mixed read/write workload cases for databases which use directIO. These patches are based upon upstream discussion with Jan Kara & Joseph [2]. The problem really is that in case of DIO overwrites, we start with a exclusive lock and then downgrade it later to shared lock. This causes a scalability problem in case of mixed DIO read/write workload case. i.e. if we have any ongoing DIO reads and then comes a DIO writes, (since writes starts with excl. inode lock) then it has to wait until the shared lock is released (which only happens when DIO read is completed). Same is true for vice versa as well. The same can be easily observed with perf-tools trace analysis [3]. For more details, including performance numbers, please see [4]. [1] https://lore.kernel.org/linux-ext4/1566871552-60946-4-git-send-email-joseph.qi@linux.alibaba.com/ [2] https://lore.kernel.org/linux-ext4/20190910215720.GA7561@quack2.suse.cz/ [3] https://raw.githubusercontent.com/riteshharjani/LinuxStudy/master/ext4/perf.report [4] https://lore.kernel.org/r/20191212055557.11151-1-riteshh@linux.ibm.com
2019-12-22ext4: Move to shared i_rwsem even without dioread_nolock mount optRitesh Harjani
We were using shared locking only in case of dioread_nolock mount option in case of DIO overwrites. This mount condition is not needed anymore with current code, since:- 1. No race between buffered writes & DIO overwrites. Since buffIO writes takes exclusive lock & DIO overwrites will take shared locking. Also DIO path will make sure to flush and wait for any dirty page cache data. 2. No race between buffered reads & DIO overwrites, since there is no block allocation that is possible with DIO overwrites. So no stale data exposure should happen. Same is the case between DIO reads & DIO overwrites. 3. Also other paths like truncate is protected, since we wait there for any DIO in flight to be over. Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-4-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22ext4: Start with shared i_rwsem in case of DIO instead of exclusiveRitesh Harjani
Earlier there was no shared lock in DIO read path. But this patch (16c54688592ce: ext4: Allow parallel DIO reads) simplified some of the locking mechanism while still allowing for parallel DIO reads by adding shared lock in inode DIO read path. But this created problem with mixed read/write workload. It is due to the fact that in DIO path, we first start with exclusive lock and only when we determine that it is a ovewrite IO, we downgrade the lock. This causes the problem, since we still have shared locking in DIO reads. So, this patch tries to fix this issue by starting with shared lock and then switching to exclusive lock only when required based on ext4_dio_write_checks(). Other than that, it also simplifies below cases:- 1. Simplified ext4_unaligned_aio API to ext4_unaligned_io. Previous API was abused in the sense that it was not really checking for AIO anywhere also it used to check for extending writes. So this API was renamed and simplified to ext4_unaligned_io() which actully only checks if the IO is really unaligned. Now, in case of unaligned direct IO, iomap_dio_rw needs to do zeroing of partial block and that will require serialization against other direct IOs in the same block. So we take a exclusive inode lock for any unaligned DIO. In case of AIO we also need to wait for any outstanding IOs to complete so that conversion from unwritten to written is completed before anyone try to map the overlapping block. Hence we take exclusive inode lock and also wait for inode_dio_wait() for unaligned DIO case. Please note since we are anyway taking an exclusive lock in unaligned IO, inode_dio_wait() becomes a no-op in case of non-AIO DIO. 2. Added ext4_extending_io(). This checks if the IO is extending the file. 3. Added ext4_dio_write_checks(). In this we start with shared inode lock and only switch to exclusive lock if required. So in most cases with aligned, non-extending, dioread_nolock & overwrites, it tries to write with a shared lock. If not, then we restart the operation in ext4_dio_write_checks(), after acquiring exclusive lock. Reviewed-by: Jan Kara <jack@suse.cz> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-3-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22ext4: fix ext4_dax_read/write inode locking sequence for IOCB_NOWAITRitesh Harjani
Apparently our current rwsem code doesn't like doing the trylock, then lock for real scheme. So change our dax read/write methods to just do the trylock for the RWF_NOWAIT case. This seems to fix AIM7 regression in some scalable filesystems upto ~25% in some cases. Claimed in commit 942491c9e6d6 ("xfs: fix AIM7 regression") Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Matthew Bobrowski <mbobrowski@mbobrowski.org> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20191212055557.11151-2-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22ext4: treat buffers contining write errors as valid in ext4_sb_bread()Theodore Ts'o
In commit 7963e5ac9012 ("ext4: treat buffers with write errors as containing valid data") we missed changing ext4_sb_bread() to use ext4_buffer_uptodate(). So fix this oversight. Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-22Linux 5.5-rc3v5.5-rc3Linus Torvalds
2019-12-22Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs fixes from Al Viro: "Eric's s_inodes softlockup fixes + Jan's fix for recent regression from pipe rework" * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: fs: call fsnotify_sb_delete after evict_inodes fs: avoid softlockups in s_inodes iterators pipe: Fix bogus dereference in iov_iter_alignment()
2019-12-22Merge tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: "Fix a few bugs that could lead to corrupt files, fsck complaints, and filesystem crashes: - Minor documentation fixes - Fix a file corruption due to read racing with an insert range operation. - Fix log reservation overflows when allocating large rt extents - Fix a buffer log item flags check - Don't allow administrators to mount with sunit= options that will cause later xfs_repair complaints about the root directory being suspicious because the fs geometry appeared inconsistent - Fix a non-static helper that should have been static" * tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Make the symbol 'xfs_rtalloc_log_count' static xfs: don't commit sunit/swidth updates to disk if that would cause repair failures xfs: split the sunit parameter update into two parts xfs: refactor agfl length computation function libxfs: resync with the userspace libxfs xfs: use bitops interface for buf log item AIL flag check xfs: fix log reservation overflows when allocating large rt extents xfs: stabilize insert range start boundary to avoid COW writeback race xfs: fix Sphinx documentation warning
2019-12-22Merge tag 'ext4_for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bug fixes from Ted Ts'o: "Ext4 bug fixes, including a regression fix" * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: clarify impact of 'commit' mount option ext4: fix unused-but-set-variable warning in ext4_add_entry() jbd2: fix kernel-doc notation warning ext4: use RCU API in debug_print_tree ext4: validate the debug_want_extra_isize mount option at parse time ext4: reserve revoke credits in __ext4_new_inode ext4: unlock on error in ext4_expand_extra_isize() ext4: optimize __ext4_check_dir_entry() ext4: check for directory entries too close to block end ext4: fix ext4_empty_dir() for directories with holes
2019-12-22Merge tag 'block-5.5-20191221' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Let's try this one again, this time without the compat_ioctl changes. We've got those fixed up, but that can go out next week. This contains: - block queue flush lockdep annotation (Bart) - Type fix for bsg_queue_rq() (Bart) - Three dasd fixes (Stefan, Jan) - nbd deadlock fix (Mike) - Error handling bio user map fix (Yang) - iocost fix (Tejun) - sbitmap waitqueue addition fix that affects the kyber IO scheduler (David)" * tag 'block-5.5-20191221' of git://git.kernel.dk/linux-block: sbitmap: only queue kyber's wait callback if not already active block: fix memleak when __blk_rq_map_user_iov() is failed s390/dasd: fix typo in copyright statement s390/dasd: fix memleak in path handling error case s390/dasd/cio: Interpret ccw_device_get_mdc return value correctly block: Fix a lockdep complaint triggered by request queue flushing block: Fix the type of 'sts' in bsg_queue_rq() block: end bio with BLK_STS_AGAIN in case of non-mq devs and REQ_NOWAIT nbd: fix shutdown and recv work deadlock v2 iocost: over-budget forced IOs should schedule async delay
2019-12-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: "PPC: - Fix a bug where we try to do an ultracall on a system without an ultravisor KVM: - Fix uninitialised sysreg accessor - Fix handling of demand-paged device mappings - Stop spamming the console on IMPDEF sysregs - Relax mappings of writable memslots - Assorted cleanups MIPS: - Now orphan, James Hogan is stepping down x86: - MAINTAINERS change, so long Radim and thanks for all the fish - supported CPUID fixes for AMD machines without SPEC_CTRL" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: MAINTAINERS: remove Radim from KVM maintainers MAINTAINERS: Orphan KVM for MIPS kvm: x86: Host feature SSBD doesn't imply guest feature AMD_SSBD kvm: x86: Host feature SSBD doesn't imply guest feature SPEC_CTRL_SSBD KVM: PPC: Book3S HV: Don't do ultravisor calls on systems without ultravisor KVM: arm/arm64: Properly handle faulting of device mappings KVM: arm64: Ensure 'params' is initialised when looking up sys register KVM: arm/arm64: Remove excessive permission check in kvm_arch_prepare_memory_region KVM: arm64: Don't log IMP DEF sysreg traps KVM: arm64: Sanely ratelimit sysreg messages KVM: arm/arm64: vgic: Use wrapper function to lock/unlock all vcpus in kvm_vgic_create() KVM: arm/arm64: vgic: Fix potential double free dist->spis in __kvm_vgic_destroy() KVM: arm/arm64: Get rid of unused arg in cpu_init_hyp_mode()
2019-12-22Merge tag 'riscv/for-v5.5-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Paul Walmsley: "Several fixes, and one cleanup, for RISC-V. Fixes: - Fix an error in a Kconfig file that resulted in an undefined Kconfig option "CONFIG_CONFIG_MMU" - Fix undefined Kconfig option "CONFIG_CONFIG_MMU" - Fix scratch register clearing in M-mode (affects nommu users) - Fix a mismerge on my part that broke the build for CONFIG_SPARSEMEM_VMEMMAP users Cleanup: - Move SiFive L2 cache-related code to drivers/soc, per request" * tag 'riscv/for-v5.5-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: move sifive_l2_cache.c to drivers/soc riscv: define vmemmap before pfn_to_page calls riscv: fix scratch register clearing in M-mode. riscv: Fix use of undefined config option CONFIG_CONFIG_MMU
2019-12-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso, including adding a missing ipv6 match description. 2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi Bhat. 3) Fix uninit value in bond_neigh_init(), from Eric Dumazet. 4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold. 5) Fix use after free in tipc_disc_rcv(), from Tuong Lien. 6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul Chaignon. 7) Multicast MAC limit test is off by one in qede, from Manish Chopra. 8) Fix established socket lookup race when socket goes from TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening RCU grace period. From Eric Dumazet. 9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet. 10) Fix active backup transition after link failure in bonding, from Mahesh Bandewar. 11) Avoid zero sized hash table in gtp driver, from Taehee Yoo. 12) Fix wrong interface passed to ->mac_link_up(), from Russell King. 13) Fix DSA egress flooding settings in b53, from Florian Fainelli. 14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost. 15) Fix double free in dpaa2-ptp code, from Ioana Ciornei. 16) Reject invalid MTU values in stmmac, from Jose Abreu. 17) Fix refcount leak in error path of u32 classifier, from Davide Caratti. 18) Fix regression causing iwlwifi firmware crashes on boot, from Anders Kaseorg. 19) Fix inverted return value logic in llc2 code, from Chan Shu Tak. 20) Disable hardware GRO when XDP is attached to qede, frm Manish Chopra. 21) Since we encode state in the low pointer bits, dst metrics must be at least 4 byte aligned, which is not necessarily true on m68k. Add annotations to fix this, from Geert Uytterhoeven. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits) sfc: Include XDP packet headroom in buffer step size. sfc: fix channel allocation with brute force net: dst: Force 4-byte alignment of dst_metrics selftests: pmtu: fix init mtu value in description hv_netvsc: Fix unwanted rx_table reset net: phy: ensure that phy IDs are correctly typed mod_devicetable: fix PHY module format qede: Disable hardware gro when xdp prog is installed net: ena: fix issues in setting interrupt moderation params in ethtool net: ena: fix default tx interrupt moderation interval net/smc: unregister ib devices in reboot_event net: stmmac: platform: Fix MDIO init for platforms without PHY llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c) net: hisilicon: Fix a BUG trigered by wrong bytes_compl net: dsa: ksz: use common define for tag len s390/qeth: don't return -ENOTSUPP to userspace s390/qeth: fix promiscuous mode after reset s390/qeth: handle error due to unsupported transport mode cxgb4: fix refcount init for TC-MQPRIO offload tc-testing: initial tdc selftests for cls_u32 ...
2019-12-22pipe: fix empty pipe check in pipe_write()Jan Stancek
LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb, with read side observing empty pipe and sleeping and write side running out of space and then sleeping as well. In this scenario there are 5 writers and 1 reader. Problem is that after pipe_write() reacquires pipe lock, it re-checks for empty pipe with potentially stale 'head' and doesn't wake up read side anymore. pipe->tail can advance beyond 'head', because there are multiple writers. Use pipe->head for empty pipe check after reacquiring lock to observe current state. Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour. Without patch it hanged within a minute. Fixes: 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic") Reported-by: Rachel Sibley <rasibley@redhat.com> Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-12-22Merge tag 'kvm-ppc-fixes-5.5-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into kvm-master PPC KVM fix for 5.5 - Fix a bug where we try to do an ultracall on a system without an ultravisor.
2019-12-22MAINTAINERS: remove Radim from KVM maintainersPaolo Bonzini
Radim's kernel.org email is bouncing, which I take as a signal that he is not really able to deal with KVM at this time. Make MAINTAINERS match the effective value of KVM's bus factor. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-22MAINTAINERS: Orphan KVM for MIPSJames Hogan
I haven't been active for 18 months, and don't have the hardware set up to test KVM for MIPS, so mark it as orphaned and remove myself as maintainer. Hopefully somebody from MIPS can pick this up. Signed-off-by: James Hogan <jhogan@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Paul Burton <paulburton@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: kvm@vger.kernel.org Cc: linux-mips@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-12-21ext4: clarify impact of 'commit' mount optionJan Kara
The description of 'commit' mount option dates back to ext3 times. Update the description to match current meaning for ext4. Reported-by: Paul Richards <paul.richards@gmail.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20191218111210.14161-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-21ext4: fix unused-but-set-variable warning in ext4_add_entry()Yunfeng Ye
Warning is found when compile with "-Wunused-but-set-variable": fs/ext4/namei.c: In function ‘ext4_add_entry’: fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used [-Wunused-but-set-variable] struct ext4_sb_info *sbi; ^~~ Fix this by moving the variable @sbi under CONFIG_UNICODE. Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-12-21Merge tag 'trace-v5.5-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing fixes from Steven Rostedt: - Fix memory leak on error path of process_system_preds() - Lock inversion fix with updating tgid recording option - Fix histogram compare function on big endian machines - Fix histogram trigger function on big endian machines - Make trace_printk() irq sync on init for kprobe selftest correctness * tag 'trace-v5.5-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: tracing: Fix endianness bug in histogram trigger samples/trace_printk: Wait for IRQ work to finish tracing: Fix lock inversion in trace_event_enable_tgid_record() tracing: Have the histogram compare functions convert to u64 first tracing: Avoid memory leak in process_system_preds()