summaryrefslogtreecommitdiff
path: root/fs/ext4/inode.c
AgeCommit message (Collapse)Author
2020-12-22ext4: drop ext4_handle_dirty_super()Jan Kara
The wrapper is now useless since it does what ext4_handle_dirty_metadata() does. Just remove it. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-9-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: protect superblock modifications with a buffer lockJan Kara
Protect all superblock modifications (including checksum computation) with a superblock buffer lock. That way we are sure computed checksum matches current superblock contents (a mismatch could cause checksum failures in nojournal mode or if an unjournalled superblock update races with a journalled one). Also we avoid modifying superblock contents while it is being written out (which can cause DIF/DIX failures if we are running in nojournal mode). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-22ext4: remove unnecessary wbc parameter from ext4_bio_write_pageLei Chen
ext4_bio_write_page does not need wbc parameter, since its parameter io contains the io_wbc field. The io::io_wbc is initialized by ext4_io_submit_init which is called in ext4_writepages and ext4_writepage functions prior to ext4_bio_write_page. Therefor, when ext4_bio_write_page is called, wbc info has already been included in io parameter. Signed-off-by: Lei Chen <lennychen@tencent.com> Link: https://lore.kernel.org/r/1607669664-25656-1-git-send-email-lennychen@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: make ext4_abort() use __ext4_error()Jan Kara
The only difference between __ext4_abort() and __ext4_error() is that the former one ignores errors=continue mount option. Unify the code to reduce duplication. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-17ext4: fix deadlock with fs freezing and EA inodesJan Kara
Xattr code using inodes with large xattr data can end up dropping last inode reference (and thus deleting the inode) from places like ext4_xattr_set_entry(). That function is called with transaction started and so ext4_evict_inode() can deadlock against fs freezing like: CPU1 CPU2 removexattr() freeze_super() vfs_removexattr() ext4_xattr_set() handle = ext4_journal_start() ... ext4_xattr_set_entry() iput(old_ea_inode) ext4_evict_inode(old_ea_inode) sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); ext4_freeze() jbd2_journal_lock_updates() -> blocks waiting for all handles to stop sb_start_intwrite() -> blocks as sb is already in SB_FREEZE_FS state Generally it is advisable to delete inodes from a separate transaction as it can consume quite some credits however in this case it would be quite clumsy and furthermore the credits for inode deletion are quite limited and already accounted for. So just tweak ext4_evict_inode() to avoid freeze protection if we have transaction already started and thus it is not really needed anyway. Cc: stable@vger.kernel.org Fixes: dec214d00e0d ("ext4: xattr inode deduplication") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201127110649.24730-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-12-03ext4: use ASSERT() to replace J_ASSERT()Chunguang Xu
There are currently multiple forms of assertion, such as J_ASSERT(). J_ASEERT() is provided for the jbd module, which is a public module. Maybe we should use custom ASSERT() like other file systems, such as xfs, which would be better. Signed-off-by: Chunguang Xu <brookxu@tencent.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1604764698-4269-1-git-send-email-brookxu@tencent.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: make s_mount_flags modifications atomicHarshad Shirwadkar
Fast commit file system states are recorded in sbi->s_mount_flags. Fast commit expects these bit manipulations to be atomic. This patch adds helpers to make those modifications atomic. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-21-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: fix inode dirty check in case of fast commitsHarshad Shirwadkar
In case of fast commits, determine if the inode is dirty by checking if the inode is on fast commit list. This also helps us get rid of ext4_inode_info.i_fc_committed_subtid field. Reported-by: Andrea Righi <andrea.righi@canonical.com> Tested-by: Andrea Righi <andrea.righi@canonical.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-18-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: fixup ext4_fc_track_* functions' signatureHarshad Shirwadkar
Firstly, pass handle to all ext4_fc_track_* functions and use transaction id found in handle->h_transaction->h_tid for tracking fast commit updates. Secondly, don't pass inode to ext4_fc_track_link/create/unlink functions. inode can be found inside these functions as d_inode(dentry). However, rename path is an exeception. That's because in that case, we need inode that's not same as d_inode(dentry). To handle that, add a couple of low-level wrapper functions that take inode and dentry as arguments. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-5-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-11-06ext4: mark fc ineligible if inode gets evictied due to mem pressureHarshad Shirwadkar
If inode gets evicted due to memory pressure, we have to remove it from the fast commit list. However, that inode may have uncommitted changes that fast commits will lose. So, just fall back to full commits in this case. Also, rename the fast commit ineligiblity reason from "EXT4_FC_REASON_MEM" to "EXT4_FC_REASON_MEM_NOMEM" for better expression. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201106035911.1942128-3-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: fix mmap write protection for data=journal modeJan Kara
Commit afb585a97f81 "ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()") added calls ext4_jbd2_inode_add_write() to track inode ranges whose mappings need to get write-protected during transaction commits. However the added calls use wrong start of a range (0 instead of page offset) and so write protection is not necessarily effective. Use correct range start to fix the problem. Fixes: afb585a97f81 ("ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()") Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201027132751.29858-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-28ext4: properly check for dirty state in ext4_inode_datasync_dirty()Andrea Righi
ext4_inode_datasync_dirty() needs to return 'true' if the inode is dirty, 'false' otherwise, but the logic seems to be incorrectly changed by commit aa75f4d3daae ("ext4: main fast-commit commit path"). This introduces a problem with swap files that are always failing to be activated, showing this error in dmesg: [ 34.406479] swapon: file is not committed Simple test case to reproduce the problem: # fallocate -l 8G swapfile # chmod 0600 swapfile # mkswap swapfile # swapon swapfile Fix the logic to return the proper state of the inode. Link: https://lore.kernel.org/lkml/20201024131333.GA32124@xps-13-7390 Fixes: 8016e29f4362 ("ext4: fast commit recovery path") Signed-off-by: Andrea Righi <andrea.righi@canonical.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201027044915.2553163-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: fix invalid inode checksumLuo Meng
During the stability test, there are some errors: ext4_lookup:1590: inode #6967: comm fsstress: iget: checksum invalid. If the inode->i_iblocks too big and doesn't set huge file flag, checksum will not be recalculated when update the inode information to it's buffer. If other inode marks the buffer dirty, then the inconsistent inode will be flushed to disk. Fix this problem by checking i_blocks in advance. Cc: stable@kernel.org Signed-off-by: Luo Meng <luomeng12@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Link: https://lore.kernel.org/r/20201020013631.3796673-1-luomeng12@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: fast commit recovery pathHarshad Shirwadkar
This patch adds fast commit recovery path support for Ext4 file system. We add several helper functions that are similar in spirit to e2fsprogs journal recovery path handlers. Example of such functions include - a simple block allocator, idempotent block bitmap update function etc. Using these routines and the fast commit log in the fast commit area, the recovery path (ext4_fc_replay()) performs fast commit log recovery. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-21ext4: main fast-commit commit pathHarshad Shirwadkar
This patch adds main fast commit commit path handlers. The overall patch can be divided into two inter-related parts: (A) Metadata updates tracking This part consists of helper functions to track changes that need to be committed during a commit operation. These updates are maintained by Ext4 in different in-memory queues. Following are the APIs and their short description that are implemented in this patch: - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat operations - ext4_fc_track_range() - Track changed logical block offsets inodes - ext4_fc_track_inode() - Track inodes - ext4_fc_mark_ineligible() - Mark file system fast commit ineligible() - ext4_fc_start_update() / ext4_fc_stop_update() / ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These functions are useful for co-ordinating inode updates with commits. (B) Main commit Path This part consists of functions to convert updates tracked in in-memory data structures into on-disk commits. Function ext4_fc_commit() is the main entry point to commit path. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20201015203802.3597742-6-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: fix bs < ps issue reported with dioread_nolock mount optRitesh Harjani
left shifting m_lblk by blkbits was causing value overflow and hence it was not able to convert unwritten to written extent. So, make sure we typecast it to loff_t before do left shift operation. Also in func ext4_convert_unwritten_io_end_vec(), make sure to initialize ret variable to avoid accidentally returning an uninitialized ret. This patch fixes the issue reported in ext4 for bs < ps with dioread_nolock mount option. Fixes: c8cc88163f40df39e50c ("ext4: Add support for blocksize < pagesize in dioread_nolock") Cc: stable@kernel.org Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/af902b5db99e8b73980c795d84ad7bb417487e76.1602168865.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()Mauricio Faria de Oliveira
This implements journal callbacks j_submit|finish_inode_data_buffers() with different behavior for data=journal: to write-protect pages under commit, preventing changes to buffers writeably mapped to userspace. If a buffer's content changes between commit's checksum calculation and write-out to disk, it can cause journal recovery/mount failures upon a kernel crash or power loss. [ 27.334874] EXT4-fs: Warning: mounting with data=journal disables delayed allocation, dioread_nolock, and O_DIRECT support! [ 27.339492] JBD2: Invalid checksum recovering data block 8705 in log [ 27.342716] JBD2: recovery failed [ 27.343316] EXT4-fs (loop0): error loading journal mount: /ext4: can't read superblock on /dev/loop0. In j_submit_inode_data_buffers() we write-protect the inode's pages with write_cache_pages() and redirty w/ writepage callback if needed. In j_finish_inode_data_buffers() there is nothing do to. And in order to use the callbacks, inodes are added to the inode list in transaction in __ext4_journalled_writepage() and ext4_page_mkwrite(). In ext4_page_mkwrite() we must make sure that the buffers are attached to the transaction as jbddirty with write_end_fn(), as already done in __ext4_journalled_writepage(). Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Reported-by: Dann Frazier <dann.frazier@canonical.com> Reported-by: kernel test robot <lkp@intel.com> # wbc.nr_to_write Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201006004841.600488-5-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: data=journal: fixes for ext4_page_mkwrite()Mauricio Faria de Oliveira
These are two fixes for data journalling required by the next patch, discovered while testing it. First, the optimization to return early if all buffers are mapped is not appropriate for the next patch: The inode _must_ be added to the transaction's list in data=journal mode (so to write-protect pages on commit) thus we cannot return early there. Second, once that optimization to reduce transactions was disabled for data=journal mode, more transactions happened, and occasionally hit this warning message: 'JBD2: Spotted dirty metadata buffer'. Reason is, block_page_mkwrite() will set_buffer_dirty() before do_journal_get_write_access() that is there to prevent it. This issue was masked by the optimization. So, on data=journal use __block_write_begin() instead. This also requires page locking and len recalculation. (see block_page_mkwrite() for implementation details.) Finally, as Jan noted there is little sharing between data=journal and other modes in ext4_page_mkwrite(). However, a prototype of ext4_journalled_page_mkwrite() showed there still would be lots of duplicated lines (tens of) that didn't seem worth it. Thus this patch ends up with an ugly goto to skip all non-data journalling code (to avoid long indentations, but that can be changed..) in the beginning, and just a conditional in the transaction section. Well, we skip a common part to data journalling which is the page truncated check, but we do it again after ext4_journal_start() when we re-acquire the page lock (so not to acquire the page lock twice needlessly for data journalling.) Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/20201006004841.600488-4-mfo@canonical.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: introduce ext4_sb_breadahead_unmovable() to replace ↵zhangyi (F)
sb_breadahead_unmovable() If we readahead inode tables in __ext4_get_inode_loc(), it may bypass buffer_write_io_error() check, so introduce ext4_sb_breadahead_unmovable() to handle this special case. This patch also replace sb_breadahead_unmovable() in ext4_fill_super() for the sake of unification. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200924073337.861472-6-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: use ext4_buffer_uptodate() in __ext4_get_inode_loc()zhangyi (F)
We have already introduced ext4_buffer_uptodate() to re-set the uptodate bit on buffer which has been failed to write out to disk. Just remove the redundant codes and switch to use ext4_buffer_uptodate() in __ext4_get_inode_loc(). Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200924073337.861472-5-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: use common helpers in all places reading metadata bufferszhangyi (F)
Revome all open codes that read metadata buffers, switch to use ext4_read_bh_*() common helpers. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Suggested-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200924073337.861472-4-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: clear buffer verified flag if read meta block from diskzhangyi (F)
The metadata buffer is no longer trusted after we read it from disk again because it is not uptodate for some reasons (e.g. failed to write back). Otherwise we may get below memory corruption problem in ext4_ext_split()->memset() if we read stale data from the newly allocated extent block on disk which has been failed to async write out but miss verify again since the verified bit has already been set on the buffer. [ 29.774674] BUG: unable to handle kernel paging request at ffff88841949d000 ... [ 29.783317] Oops: 0002 [#2] SMP [ 29.784219] R10: 00000000000f4240 R11: 0000000000002e28 R12: ffff88842fa1c800 [ 29.784627] CPU: 1 PID: 126 Comm: kworker/u4:3 Tainted: G D W [ 29.785546] R13: ffffffff9cddcc20 R14: ffffffff9cddd420 R15: ffff88842fa1c2f8 [ 29.786679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS ?-20190727_0738364 [ 29.787588] FS: 0000000000000000(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000 [ 29.789288] Workqueue: writeback wb_workfn [ 29.790319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 29.790321] (flush-8:0) [ 29.790844] CR2: 0000000000000008 CR3: 00000004234f2000 CR4: 00000000000006f0 [ 29.791924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 29.792839] RIP: 0010:__memset+0x24/0x30 [ 29.793739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 29.794256] Code: 90 90 90 90 90 90 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 033 [ 29.795161] Kernel panic - not syncing: Fatal exception in interrupt ... [ 29.808149] Call Trace: [ 29.808475] ext4_ext_insert_extent+0x102e/0x1be0 [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0 [ 29.809652] ext4_map_blocks+0x290/0x8a0 [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0 [ 29.809652] ext4_map_blocks+0x290/0x8a0 [ 29.810161] ext4_writepages+0xc85/0x17c0 ... Fix this by clearing buffer's verified bit if we read meta block from disk again. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200924073337.861472-2-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: optimize file overwritesRitesh Harjani
In case if the file already has underlying blocks/extents allocated then we don't need to start a journal txn and can directly return the underlying mapping. Currently ext4_iomap_begin() is used by both DAX & DIO path. We can check if the write request is an overwrite & then directly return the mapping information. This could give a significant perf boost for multi-threaded writes specially random overwrites. On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement could be seen in random writes (overwrite). Also bcoz this optimizes away the spinlock contention during jbd2 slab cache allocation (jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed. Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/88e795d8a4d5cd22165c7ebe857ba91d68d8813e.1600401668.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: delete duplicated words + other fixesRandy Dunlap
Delete repeated words in fs/ext4/. {the, this, of, we, after} Also change spelling of "xttr" in inline.c to "xattr" in 2 places. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200805024850.12129-1-rdunlap@infradead.org Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-10-18ext4: implement swap_activate aops using iomapRitesh Harjani
After moving ext4's bmap to iomap interface, swapon functionality on files created using fallocate (which creates unwritten extents) are failing. This is since iomap_bmap interface returns 0 for unwritten extents and thus generic_swapfile_activate considers this as holes and hence bail out with below kernel msg :- [340.915835] swapon: swapfile has holes To fix this we need to implement ->swap_activate aops in ext4 which will use ext4_iomap_report_ops. Since we only need to return the list of extents so ext4_iomap_report_ops should be enough. Cc: stable@kernel.org Reported-by: Yuxuan Shui <yshuiv7@gmail.com> Fixes: ac58e4fb03f ("ext4: move ext4 bmap to use iomap infrastructure") Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200904091653.1014334-1-riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-28Merge tag 'writeback_for_v5.9-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull writeback fixes from Jan Kara: "Fixes for writeback code occasionally skipping writeback of some inodes or livelocking sync(2)" * tag 'writeback_for_v5.9-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: writeback: Drop I_DIRTY_TIME_EXPIRE writeback: Fix sync livelock due to b_dirty_time processing writeback: Avoid skipping inode writeback writeback: Protect inode->i_io_list with inode->i_lock
2020-08-21Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Improvements to ext4's block allocator performance for very large file systems, especially when the file system or files which are highly fragmented. There is a new mount option, prefetch_block_bitmaps which will pull in the block bitmaps and set up the in-memory buddy bitmaps when the file system is initially mounted. Beyond that, a lot of bug fixes and cleanups. In particular, a number of changes to make ext4 more robust in the face of write errors or file system corruptions" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits) ext4: limit the length of per-inode prealloc list ext4: reorganize if statement of ext4_mb_release_context() ext4: add mb_debug logging when there are lost chunks ext4: Fix comment typo "the the". jbd2: clean up checksum verification in do_one_pass() ext4: change to use fallthrough macro ext4: remove unused parameter of ext4_generic_delete_entry function mballoc: replace seq_printf with seq_puts ext4: optimize the implementation of ext4_mb_good_group() ext4: delete invalid comments near ext4_mb_check_limits() ext4: fix typos in ext4_mb_regular_allocator() comment ext4: fix checking of directory entry validity for inline directories fs: prevent BUG_ON in submit_bh_wbc() ext4: correctly restore system zone info when remount fails ext4: handle add_system_zone() failure in ext4_setup_system_zone() ext4: fold ext4_data_block_valid_rcu() into the caller ext4: check journal inode extents more carefully ext4: don't allow overlapping system zones ext4: handle error of ext4_setup_system_zone() on remount ext4: delete the invalid BUGON in ext4_mb_load_buddy_gfp() ...
2020-08-19ext4: limit the length of per-inode prealloc listbrookxu
In the scenario of writing sparse files, the per-inode prealloc list may be very long, resulting in high overhead for ext4_mb_use_preallocated(). To circumvent this problem, we limit the maximum length of per-inode prealloc list to 512 and allow users to modify it. After patching, we observed that the sys ratio of cpu has dropped, and the system throughput has increased significantly. We created a process to write the sparse file, and the running time of the process on the fixed kernel was significantly reduced, as follows: Running time on unfixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m2.051s user 0m0.008s sys 0m2.026s Running time on fixed kernel: [root@TENCENT64 ~]# time taskset 0x01 ./sparse /data1/sparce.dat real 0m0.471s user 0m0.004s sys 0m0.395s Signed-off-by: Chunguang Xu <brookxu@tencent.com> Link: https://lore.kernel.org/r/d7a98178-056b-6db5-6bce-4ead23f4a257@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07ext4: check journal inode extents more carefullyJan Kara
Currently, system zones just track ranges of block, that are "important" fs metadata (bitmaps, group descriptors, journal blocks, etc.). This however complicates how extent tree (or indirect blocks) can be checked for inodes that actually track such metadata - currently the journal inode but arguably we should be treating quota files or resize inode similarly. We cannot run __ext4_ext_check() on such metadata inodes when loading their extents as that would immediately trigger the validity checks and so we just hack around that and special-case the journal inode. This however leads to a situation that a journal inode which has extent tree of depth at least one can have invalid extent tree that gets unnoticed until ext4_cache_extents() crashes. To overcome this limitation, track inode number each system zone belongs to (0 is used for zones not belonging to any inode). We can then verify inode number matches the expected one when verifying extent tree and thus avoid the false errors. With this there's no need to to special-case journal inode during extent tree checking anymore so remove it. Fixes: 0a944e8a6c66 ("ext4: don't perform block validity checks on the journal inode") Reported-by: Wolfgang Frisch <wolfgang.frisch@suse.com> Reviewed-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200728130437.7804-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-07jbd2: remove unused parameter in jbd2_journal_try_to_free_buffers()zhangyi (F)
Parameter gfp_mask in jbd2_journal_try_to_free_buffers() is no longer used after commit <536fc240e7147> ("jbd2: clean up jbd2_journal_try_to_free_buffers()"), so just remove it. Signed-off-by: zhangyi (F) <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20200620025427.1756360-6-yi.zhang@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-08-06ext4: lost matching-pair of trace in ext4_truncatezhengliang
It should call trace exit in all return path for ext4_truncate. Signed-off-by: zhengliang <zhengliang6@huawei.com> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200701083027.45996-1-zhengliang6@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-07-08ext4: add inline encryption supportEric Biggers
Wire up ext4 to support inline encryption via the helper functions which fs/crypto/ now provides. This includes: - Adding a mount option 'inlinecrypt' which enables inline encryption on encrypted files where it can be used. - Setting the bio_crypt_ctx on bios that will be submitted to an inline-encrypted file. Note: submit_bh_wbc() in fs/buffer.c also needed to be patched for this part, since ext4 sometimes uses ll_rw_block() on file data. - Not adding logically discontiguous data to bios that will be submitted to an inline-encrypted file. - Not doing filesystem-layer crypto on inline-encrypted files. Co-developed-by: Satya Tangirala <satyat@google.com> Signed-off-by: Satya Tangirala <satyat@google.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20200702015607.1215430-5-satyat@google.com Signed-off-by: Eric Biggers <ebiggers@google.com>
2020-06-15Merge tag 'ext4-for-linus-5.8-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull more ext4 updates from Ted Ts'o: "This is the second round of ext4 commits for 5.8 merge window [1]. It includes the per-inode DAX support, which was dependant on the DAX infrastructure which came in via the XFS tree, and a number of regression and bug fixes; most notably the "BUG: using smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported by syzkaller" [1] The pull request actually came in 15 minutes after I had tagged the rc1 release. Tssk, tssk, late.. - Linus * tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers ext4: support xattr gnu.* namespace for the Hurd ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr ext4: avoid utf8_strncasecmp() with unstable name ext4: stop overwrite the errcode in ext4_setup_super ext4: fix partial cluster initialization when splitting extent ext4: avoid race conditions when remounting with options that change dax Documentation/dax: Update DAX enablement for ext4 fs/ext4: Introduce DAX inode flag fs/ext4: Remove jflag variable fs/ext4: Make DAX mount option a tri-state fs/ext4: Only change S_DAX on inode load fs/ext4: Update ext4_should_use_dax() fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS fs/ext4: Disallow verity if inode is DAX fs/ext4: Narrow scope of DAX check in setflags
2020-06-15writeback: Drop I_DIRTY_TIME_EXPIREJan Kara
The only use of I_DIRTY_TIME_EXPIRE is to detect in __writeback_single_inode() that inode got there because flush worker decided it's time to writeback the dirty inode time stamps (either because we are syncing or because of age). However we can detect this directly in __writeback_single_inode() and there's no need for the strange propagation with I_DIRTY_TIME_EXPIRE flag. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz>
2020-06-11Enable ext4 support for per-file/directory dax operationsTheodore Ts'o
This adds the same per-file/per-directory DAX support for ext4 as was done for xfs, now that we finally have consensus over what the interface should be.
2020-06-05Merge tag 'afs-next-20200604' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS updates from David Howells: "There's some core VFS changes which affect a couple of filesystems: - Make the inode hash table RCU safe and providing some RCU-safe accessor functions. The search can then be done without taking the inode_hash_lock. Care must be taken because the object may be being deleted and no wait is made. - Allow iunique() to avoid taking the inode_hash_lock. - Allow AFS's callback processing to avoid taking the inode_hash_lock when using the inode table to find an inode to notify. - Improve Ext4's time updating. Konstantin Khlebnikov said "For now, I've plugged this issue with try-lock in ext4 lazy time update. This solution is much better." Then there's a set of changes to make a number of improvements to the AFS driver: - Improve callback (ie. third party change notification) processing by: (a) Relying more on the fact we're doing this under RCU and by using fewer locks. This makes use of the RCU-based inode searching outlined above. (b) Moving to keeping volumes in a tree indexed by volume ID rather than a flat list. (c) Making the server and volume records logically part of the cell. This means that a server record now points directly at the cell and the tree of volumes is there. This removes an N:M mapping table, simplifying things. - Improve keeping NAT or firewall channels open for the server callbacks to reach the client by actively polling the fileserver on a timed basis, instead of only doing it when we have an operation to process. - Improving detection of delayed or lost callbacks by including the parent directory in the list of file IDs to be queried when doing a bulk status fetch from lookup. We can then check to see if our copy of the directory has changed under us without us getting notified. - Determine aliasing of cells (such as a cell that is pointed to be a DNS alias). This allows us to avoid having ambiguity due to apparently different cells using the same volume and file servers. - Improve the fileserver rotation to do more probing when it detects that all of the addresses to a server are listed as non-responsive. It's possible that an address that previously stopped responding has become responsive again. Beyond that, lay some foundations for making some calls asynchronous: - Turn the fileserver cursor struct into a general operation struct and hang the parameters off of that rather than keeping them in local variables and hang results off of that rather than the call struct. - Implement some general operation handling code and simplify the callers of operations that affect a volume or a volume component (such as a file). Most of the operation is now done by core code. - Operations are supplied with a table of operations to issue different variants of RPCs and to manage the completion, where all the required data is held in the operation object, thereby allowing these to be called from a workqueue. - Put the standard "if (begin), while(select), call op, end" sequence into a canned function that just emulates the current behaviour for now. There are also some fixes interspersed: - Don't let the EACCES from ICMP6 mapping reach the user as such, since it's confusing as to whether it's a filesystem error. Convert it to EHOSTUNREACH. - Don't use the epoch value acquired through probing a server. If we have two servers with the same UUID but in different cells, it's hard to draw conclusions from them having different epoch values. - Don't interpret the argument to the CB.ProbeUuid RPC as a fileserver UUID and look up a fileserver from it. - Deal with servers in different cells having the same UUIDs. In the event that a CB.InitCallBackState3 RPC is received, we have to break the callback promises for every server record matching that UUID. - Don't let afs_statfs return values that go below 0. - Don't use running fileserver probe state to make server selection and address selection decisions on. Only make decisions on final state as the running state is cleared at the start of probing" Acked-by: Al Viro <viro@zeniv.linux.org.uk> (fs/inode.c part) * tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits) afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly afs: Show more a bit more server state in /proc/net/afs/servers afs: Don't use probe running state to make decisions outside probe code afs: Fix afs_statfs() to not let the values go below zero afs: Fix the by-UUID server tree to allow servers with the same UUID afs: Reorganise volume and server trees to be rooted on the cell afs: Add a tracepoint to track the lifetime of the afs_volume struct afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op afs: Detect cell aliases 2 - Cells with no root volumes afs: Detect cell aliases 1 - Cells with root volumes afs: Implement client support for the YFSVL.GetCellName RPC op afs: Retain more of the VLDB record for alias detection afs: Fix handling of CB.ProbeUuid cache manager op afs: Don't get epoch from a server because it may be ambiguous afs: Build an abstraction around an "operation" concept afs: Rename struct afs_fs_cursor to afs_operation afs: Remove the error argument from afs_protocol_error() afs: Set error flag rather than return error from file status decode afs: Make callback processing more efficient. afs: Show more information in /proc/net/afs/servers ...
2020-06-05Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "A lot of bug fixes and cleanups for ext4, including: - Fix performance problems found in dioread_nolock now that it is the default, caused by transaction leaks. - Clean up fiemap handling in ext4 - Clean up and refactor multiple block allocator (mballoc) code - Fix a problem with mballoc with a smaller file systems running out of blocks because they couldn't properly use blocks that had been reserved by inode preallocation. - Fixed a race in ext4_sync_parent() versus rename() - Simplify the error handling in the extent manipulation code - Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers. - Avoid passing an error pointer to brelse in ext4_xattr_set() - Fix race which could result to freeing an inode on the dirty last in data=journal mode. - Fix refcount handling if ext4_iget() fails - Fix a crash in generic/019 caused by a corrupted extent node" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits) ext4: avoid unnecessary transaction starts during writeback ext4: don't block for O_DIRECT if IOCB_NOWAIT is set ext4: remove the access_ok() check in ext4_ioctl_get_es_cache fs: remove the access_ok() check in ioctl_fiemap fs: handle FIEMAP_FLAG_SYNC in fiemap_prep fs: move fiemap range validation into the file systems instances iomap: fix the iomap_fiemap prototype fs: move the fiemap definitions out of fs.h fs: mark __generic_block_fiemap static ext4: remove the call to fiemap_check_flags in ext4_fiemap ext4: split _ext4_fiemap ext4: fix fiemap size checks for bitmap files ext4: fix EXT4_MAX_LOGICAL_BLOCK macro add comment for ext4_dir_entry_2 file_type member jbd2: avoid leaking transaction credits when unreserving handle ext4: drop ext4_journal_free_reserved() ext4: mballoc: use lock for checking free blocks while retrying ext4: mballoc: refactor ext4_mb_good_group() ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling ext4: mballoc: refactor ext4_mb_discard_preallocations() ...
2020-06-03ext4: avoid unnecessary transaction starts during writebackJan Kara
ext4_writepages() currently works in a loop like: start a transaction scan inode for pages to write map and submit these pages stop the transaction This loop results in starting transaction once more than is needed because in the last iteration we start a transaction only to scan the inode and find there are no pages to write. This can be significant increase in number of transaction starts for single-extent files or files that have all blocks already mapped. Furthermore we already know from previous iteration whether there are more pages to write or not. So propagate the information from mpage_prepare_extent_to_map() and avoid unnecessary looping in case there are no more pages to write. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200525081215.29451-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: make ext_debug() implementation to use pr_debug()Ritesh Harjani
ext_debug() msgs could be helpful, provided those could be enabled without recompiling kernel and also if we could selectively enable only required prints for case by case debugging. So make ext_debug() implementation use pr_debug(). Also change ext_debug() to be defined with CONFIG_EXT4_DEBUG. So EXT_DEBUG macro now mostly remain for below 3 functions. ext4_ext_show_path/leaf/move() (whose print msgs use ext_debug() which again could be dynamically enabled using pr_debug()) This also changes the ext_debug() to take inode as a parameter to add inode no. in all of it's msgs. Prints additional info like process name / pid, superblock id etc. This also removes any explicit function names passed in ext_debug(). Since ext_debug() on it's own prints file, func and line no. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/d31dc189b0aeda9384fe7665e36da7cd8c61571f.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: improve ext_debug() msg in case of block allocation failureRitesh Harjani
ext4_map_blocks() has ext_debug msg early at the start of function. We also get ext_debug msg if we could allocate a block from ext4_ext_map_blocks(). But there is no ext_debug() msg in case of block allocation failure. So add one along with error code. Also add more info in ext_debug() msg like how many blocks were allocated v/s how many were requested in ext4_ext_map_blocks(). Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/1610ec2aa932396be00f9d552fe29da473ead176.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: use BIT() macro for BH_** state bitsRitesh Harjani
Simply use BIT() macro for all BH_** state bits instead of open coding it. There should be no functionality change in this patch. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/57667689f51a3f9dba2fcef7d3425187fa3ba69f.1589086800.git.riteshh@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: handle ext4_mark_inode_dirty errorsHarshad Shirwadkar
ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return value may lead ext4 to ignore real failures that would result in corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail as soon as possible and return errors to the caller whenever appropriate. One of the possible scnearios when this bug could affected is that while creating a new inode, its directory entry gets added successfully but while writing the inode itself mark_inode_dirty returns error which is ignored. This would result in inconsistency that the directory entry points to a non-existent inode. Ran gce-xfstests smoke tests and verified that there were no regressions. Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com> Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: Avoid freeing inodes on dirty listJan Kara
When we are evicting inode with journalled data, we may race with transaction commit in the following way: CPU0 CPU1 jbd2_journal_commit_transaction() evict(inode) inode_io_list_del() inode_wait_for_writeback() process BJ_Forget list __jbd2_journal_insert_checkpoint() __jbd2_journal_refile_buffer() __jbd2_journal_unfile_buffer() if (test_clear_buffer_jbddirty(bh)) mark_buffer_dirty(bh) __mark_inode_dirty(inode) ext4_evict_inode(inode) frees the inode This results in use-after-free issues in the writeback code (or the assertion added in the previous commit triggering). Fix the problem by removing inode from writeback lists once all the page cache is evicted and so inode cannot be added to writeback lists again. Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200421085445.5731-4-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: remove unnecessary comparisons to boolJason Yan
Fix the following coccicheck warning: fs/ext4/extents_status.c:1057:5-28: WARNING: Comparison to bool fs/ext4/inode.c:2314:18-24: WARNING: Comparison to bool Signed-off-by: Jason Yan <yanaijie@huawei.com> Reviewed-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/r/20200420042918.19459-1-yanaijie@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-03ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flagEric Whitney
The eofblocks code was removed in the 5.7 release by "ext4: remove EOFBLOCKS_FL and associated code" (4337ecd1fe99). The ext4_map_blocks() flag used to trigger it can now be removed as well. Signed-off-by: Eric Whitney <enwlinux@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20200415203140.30349-2-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-06-02ext4: pass the inode to ext4_mpage_readpagesMatthew Wilcox (Oracle)
This function now only uses the mapping argument to look up the inode, and both callers already have the inode, so just pass the inode instead of the mapping. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-22-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02ext4: convert from readpages to readaheadMatthew Wilcox (Oracle)
Use the new readahead operation in ext4 Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Eric Biggers <ebiggers@google.com> Cc: Chao Yu <yuchao0@huawei.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Dave Chinner <dchinner@redhat.com> Cc: Gao Xiang <gaoxiang25@huawei.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Link: http://lkml.kernel.org/r/20200414150233.24495-21-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-05-31vfs, afs, ext4: Make the inode hash table RCU searchableDavid Howells
Make the inode hash table RCU searchable so that searches that want to access or modify an inode without taking a ref on that inode can do so without taking the inode hash table lock. The main thing this requires is some RCU annotation on the list manipulation operations. Inodes are already freed by RCU in most cases. Users of this interface must take care as the inode may be still under construction or may be being torn down around them. There are at least three instances where this can be of use: (1) Testing whether the inode number iunique() is going to return is currently unique (the iunique_lock is still held). (2) Ext4 date stamp updating. (3) AFS callback breaking. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> cc: linux-ext4@vger.kernel.org cc: linux-afs@lists.infradead.org
2020-05-28fs/ext4: Introduce DAX inode flagIra Weiny
Add a flag ([EXT4|FS]_DAX_FL) to preserve FS_XFLAG_DAX in the ext4 inode. Set the flag to be user visible and changeable. Set the flag to be inherited. Allow applications to change the flag at any time except if it conflicts with the set of mutually exclusive flags (Currently VERITY, ENCRYPT, JOURNAL_DATA). Furthermore, restrict setting any of the exclusive flags if DAX is set. While conceptually possible, we do not allow setting EXT4_DAX_FL while at the same time clearing exclusion flags (or vice versa) for 2 reasons: 1) The DAX flag does not take effect immediately which introduces quite a bit of complexity 2) There is no clear use case for being this flexible Finally, on regular files, flag the inode to not be cached to facilitate changing S_DAX on the next creation of the inode. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-9-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2020-05-28fs/ext4: Make DAX mount option a tri-stateIra Weiny
We add 'always', 'never', and 'inode' (default). '-o dax' continues to operate the same which is equivalent to 'always'. This new functionality is limited to ext4 only. Specifically we introduce a 2nd DAX mount flag EXT4_MOUNT2_DAX_NEVER and set it and EXT4_MOUNT_DAX_ALWAYS appropriately for the mode. We also force EXT4_MOUNT2_DAX_NEVER if !CONFIG_FS_DAX. Finally, EXT4_MOUNT2_DAX_INODE is used solely to detect if the user specified that option for printing. Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/20200528150003.828793-7-ira.weiny@intel.com Signed-off-by: Theodore Ts'o <tytso@mit.edu>