summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-03-29Merge tag 'jfs-5.18' of https://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs updates from Dave Kleikamp: "A couple bug fixes" * tag 'jfs-5.18' of https://github.com/kleikamp/linux-shaggy: jfs: prevent NULL deref in diFree jfs: fix divide error in dbNextAG
2022-03-29fs: fd tables have to be multiples of BITS_PER_LONGLinus Torvalds
This has always been the rule: fdtables have several bitmaps in them, and as a result they have to be sized properly for bitmaps. We walk those bitmaps in chunks of 'unsigned long' in serveral cases, but even when we don't, we use the regular kernel bitops that are defined to work on arrays of 'unsigned long', not on some byte array. Now, the distinction between arrays of bytes and 'unsigned long' normally only really ends up being noticeable on big-endian systems, but Fedor Pchelkin and Alexey Khoroshilov reported that copy_fd_bitmaps() could be called with an argument that wasn't even a multiple of BITS_PER_BYTE. And then it fails to do the proper copy even on little-endian machines. The bug wasn't in copy_fd_bitmap(), but in sane_fdtable_size(), which didn't actually sanitize the fdtable size sufficiently, and never made sure it had the proper BITS_PER_LONG alignment. That's partly because the alignment historically came not from having to explicitly align things, but simply from previous fdtable sizes, and from count_open_files(), which counts the file descriptors by walking them one 'unsigned long' word at a time and thus naturally ends up doing sizing in the proper 'chunks of unsigned long'. But with the introduction of close_range(), we now have an external source of "this is how many files we want to have", and so sane_fdtable_size() needs to do a better job. This also adds that explicit alignment to alloc_fdtable(), although there it is mainly just for documentation at a source code level. The arithmetic we do there to pick a reasonable fdtable size already aligns the result sufficiently. In fact,clang notices that the added ALIGN() in that function doesn't actually do anything, and does not generate any extra code for it. It turns out that gcc ends up confusing itself by combining a previous constant-sized shift operation with the variable-sized shift operations in roundup_pow_of_two(). And probably due to that doesn't notice that the ALIGN() is a no-op. But that's a (tiny) gcc misfeature that doesn't matter. Having the explicit alignment makes sense, and would actually matter on a 128-bit architecture if we ever go there. This also adds big comments above both functions about how fdtable sizes have to have that BITS_PER_LONG alignment. Fixes: 60997c3d45d9 ("close_range: add CLOSE_RANGE_UNSHARE") Reported-by: Fedor Pchelkin <aissur0002@gmail.com> Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Link: https://lore.kernel.org/all/20220326114009.1690-1-aissur0002@gmail.com/ Tested-and-acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-29io_uring: defer msg-ring file validity check until command issuefor-5.18/io_uring-2022-04-01Jens Axboe
In preparation for not using the file at prep time, defer checking if this file refers to a valid io_uring instance until issue time. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-29io_uring: fail links if msg-ring doesn't succeeedJens Axboe
We must always call req_set_fail() if the request is failed, otherwise we won't sever links for dependent chains correctly. Fixes: 4f57f06ce218 ("io_uring: add support for IORING_OP_MSG_RING command") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-28Merge tag 'ptrace-cleanups-for-v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace cleanups from Eric Biederman: "This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where anything left in tracehook.h was some weird strange thing that was difficult to understand" * tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Remove duplicated include in ptrace.c ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE ptrace: Return the signal to continue with from ptrace_stop ptrace: Move setting/clearing ptrace_message into ptrace_stop tracehook: Remove tracehook.h resume_user_mode: Move to resume_user_mode.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume signal: Move set_notify_signal and clear_notify_signal into sched/signal.h task_work: Decouple TIF_NOTIFY_SIGNAL and task_work task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Introduce task_work_pending task_work: Remove unnecessary include from posix_timers.h ptrace: Remove tracehook_signal_handler ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Move ptrace_report_syscall into ptrace.h
2022-03-28smb3: cleanup and clarify status of tree connectionsSteve French
Currently the way the tid (tree connection) status is tracked is confusing. The same enum is used for structs cifs_tcon and cifs_ses and TCP_Server_info, but each of these three has different states that they transition among. The current code also unnecessarily uses camelCase. Convert from use of statusEnum to a new tid_status_enum for tree connections. The valid states for a tid are: TID_NEW = 0, TID_GOOD, TID_EXITING, TID_NEED_RECON, TID_NEED_TCON, TID_IN_TCON, TID_NEED_FILES_INVALIDATE, /* unused, considering removing in future */ TID_IN_FILES_INVALIDATE It also removes CifsNeedTcon, CifsInTcon, CifsNeedFilesInvalidate and CifsInFilesInvalidate from the statusEnum used for session and TCP_Server_Info since they are not relevant for those. A follow on patch will fix the places where we use the tcon->need_reconnect flag to be more consistent with the tid->status. Also fixes a bug that was: Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-03-28Merge tag 'driver-core-5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core changes for 5.18-rc1. Not much here, primarily it was a bunch of cleanups and small updates: - kobj_type cleanups for default_groups - documentation updates - firmware loader minor changes - component common helper added and take advantage of it in many drivers (the largest part of this pull request). All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits) Documentation: update stable review cycle documentation drivers/base/dd.c : Remove the initial value of the global variable Documentation: update stable tree link Documentation: add link to stable release candidate tree devres: fix typos in comments Documentation: add note block surrounding security patch note samples/kobject: Use sysfs_emit instead of sprintf base: soc: Make soc_device_match() simpler and easier to read driver core: dd: fix return value of __setup handler driver core: Refactor sysfs and drv/bus remove hooks driver core: Refactor multiple copies of device cleanup scripts: get_abi.pl: Fix typo in help message kernfs: fix typos in comments kernfs: remove unneeded #if 0 guard ALSA: hda/realtek: Make use of the helper component_compare_dev_name video: omapfb: dss: Make use of the helper component_compare_dev power: supply: ab8500: Make use of the helper component_compare_dev ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of iommu/mediatek: Make use of the helper component_compare/release_of drm: of: Make use of the helper component_release_of ...
2022-03-28xfs: don't report reserved bnobt space as availableDarrick J. Wong
On a modern filesystem, we don't allow userspace to allocate blocks for data storage from the per-AG space reservations, the user-controlled reservation pool that prevents ENOSPC in the middle of internal operations, or the internal per-AG set-aside that prevents unwanted filesystem shutdowns due to ENOSPC during a bmap btree split. Since we now consider freespace btree blocks as unavailable for allocation for data storage, we shouldn't report those blocks via statfs either. This makes the numbers that we return via the statfs f_bavail and f_bfree fields a more conservative estimate of actual free space. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-28xfs: fix overfilling of reserve poolDarrick J. Wong
Due to cycling of m_sb_lock, it's possible for multiple callers of xfs_reserve_blocks to race at changing the pool size, subtracting blocks from fdblocks, and actually putting it in the pool. The result of all this is that we can overfill the reserve pool to hilarious levels. xfs_mod_fdblocks, when called with a positive value, already knows how to take freed blocks and either fill the reserve until it's full, or put them in fdblocks. Use that instead of setting m_resblks_avail directly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-28xfs: always succeed at setting the reserve pool sizeDarrick J. Wong
Nowadays, xfs_mod_fdblocks will always choose to fill the reserve pool with freed blocks before adding to fdblocks. Therefore, we can change the behavior of xfs_reserve_blocks slightly -- setting the target size of the pool should always succeed, since a deficiency will eventually be made up as blocks get freed. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-28xfs: remove infinite loop when reserving free block poolDarrick J. Wong
Infinite loops in kernel code are scary. Calls to xfs_reserve_blocks should be rare (people should just use the defaults!) so we really don't need to try so hard. Simplify the logic here by removing the infinite loop. Cc: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-28xfs: don't include bnobt blocks when reserving free block poolDarrick J. Wong
xfs_reserve_blocks controls the size of the user-visible free space reserve pool. Given the difference between the current and requested pool sizes, it will try to reserve free space from fdblocks. However, the amount requested from fdblocks is also constrained by the amount of space that we think xfs_mod_fdblocks will give us. If we forget to subtract m_allocbt_blks before calling xfs_mod_fdblocks, it will will return ENOSPC and we'll hang the kernel at mount due to the infinite loop. In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand out the "free space" used by the free space btrees, because some portion of the free space btrees hold in reserve space for future btree expansion. Unfortunately, xfs_reserve_blocks' estimation of the number of blocks that it could request from xfs_mod_fdblocks was not updated to include m_allocbt_blks, so if space is extremely low, the caller hangs. Fix this by creating a function to estimate the number of blocks that can be reserved from fdblocks, which needs to exclude the set-aside and m_allocbt_blks. Found by running xfs/306 (which formats a single-AG 20MB filesystem) with an fstests configuration that specifies a 1k blocksize and a specially crafted log size that will consume 7/8 of the space (17920 blocks, specifically) in that AG. Cc: Brian Foster <bfoster@redhat.com> Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2022-03-28NFSv4/pNFS: Fix another issue with a list iterator pointing to the headTrond Myklebust
In nfs4_callback_devicenotify(), if we don't find a matching entry for the deviceid, we're left with a pointer to 'struct nfs_server' that actually points to the list of super blocks associated with our struct nfs_client. Furthermore, even if we have a valid pointer, nothing pins the super block, and so the struct nfs_server could end up getting freed while we're using it. Since all we want is a pointer to the struct pnfs_layoutdriver_type, let's skip all the iteration over super blocks, and just use APIs to find the layout driver directly. Reported-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Fixes: 1be5683b03a7 ("pnfs: CB_NOTIFY_DEVICEID") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-27Merge tag 'x86_core_for_5.18_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CET-IBT (Control-Flow-Integrity) support from Peter Zijlstra: "Add support for Intel CET-IBT, available since Tigerlake (11th gen), which is a coarse grained, hardware based, forward edge Control-Flow-Integrity mechanism where any indirect CALL/JMP must target an ENDBR instruction or suffer #CP. Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation is limited to 2 instructions (and typically fewer) on branch targets not starting with ENDBR. CET-IBT also limits speculation of the next sequential instruction after the indirect CALL/JMP [1]. CET-IBT is fundamentally incompatible with retpolines, but provides, as described above, speculation limits itself" [1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html * tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) kvm/emulate: Fix SETcc emulation for ENDBR x86/Kconfig: Only allow CONFIG_X86_KERNEL_IBT with ld.lld >= 14.0.0 x86/Kconfig: Only enable CONFIG_CC_HAS_IBT for clang >= 14.0.0 kbuild: Fixup the IBT kbuild changes x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy x86: Remove toolchain check for X32 ABI capability x86/alternative: Use .ibt_endbr_seal to seal indirect calls objtool: Find unused ENDBR instructions objtool: Validate IBT assumptions objtool: Add IBT/ENDBR decoding objtool: Read the NOENDBR annotation x86: Annotate idtentry_df() x86,objtool: Move the ASM_REACHABLE annotation to objtool.h x86: Annotate call_on_stack() objtool: Rework ASM_REACHABLE x86: Mark __invalid_creds() __noreturn exit: Mark do_group_exit() __noreturn x86: Mark stop_this_cpu() __noreturn objtool: Ignore extra-symbol code objtool: Rename --duplicate to --lto ...
2022-03-26smb3: move defines for query info and query fsinfo to smbfs_commonSteve French
Includes moving to common code (from cifs and ksmbd protocol related headers) - query and query directory info levels and structs - set info structs - SMB2 lock struct and flags - SMB2 echo req Also shorten a few flag names (e.g. SMB2_LOCKFLAG_EXCLUSIVE_LOCK to SMB2_LOCKFLAG_EXCLUSIVE) Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-03-26smb3: move defines for ioctl protocol header and SMB2 sizes to smbfs_commonSteve French
The definitions for the ioctl SMB3 request and response as well as length of various fields defined in the protocol documentation were duplicated in fs/ksmbd and fs/cifs. Move these to the common code in fs/smbfs_common/smb2pdu.h Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-03-26Merge tag 'write-page-prefaulting' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull iomap fixlet from Andreas Gruenbacher: "Fix buffered write page prefaulting. I forgot to send it the previous merge window. I've only improved the patch description since" * tag 'write-page-prefaulting' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: fs/iomap: Fix buffered write page prefaulting
2022-03-26Merge tag 'for-5.18/alloc-cleanups-2022-03-25' of ↵Linus Torvalds
git://git.kernel.dk/linux-block Pull bio allocation fix from Jens Axboe: "We got some reports of users seeing: Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x1192888 which is a regression caused by the bio allocation cleanups" * tag 'for-5.18/alloc-cleanups-2022-03-25' of git://git.kernel.dk/linux-block: fs: do not pass __GFP_HIGHMEM to bio_alloc in do_mpage_readpage
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-25NFS: Don't loop forever in nfs_do_recoalesce()Trond Myklebust
If __nfs_pageio_add_request() fails to add the request, it will return with either desc->pg_error < 0, or mirror->pg_recoalesce will be set, so we are guaranteed either to exit the function altogether, or to loop. However if there is nothing left in mirror->pg_list to coalesce, we must exit, so make sure that we clear mirror->pg_recoalesce every time we loop. Reported-by: Olga Kornievskaia <aglo@umich.edu> Fixes: 70536bf4eb07 ("NFS: Clean up reset of the mirror accounting variables") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-25Merge tag 'fs_for_v5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull reiserfs updates from Jan Kara: "The biggest change in this pull is the addition of a deprecation message about reiserfs with the outlook that we'd eventually be able to remove it from the kernel. Because it is practically unmaintained and untested and odd enough that people don't want to bother with it anymore... Otherwise there are small udf and ext2 fixes" * tag 'fs_for_v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: remove redundant assignment of variable etype reiserfs: Deprecate reiserfs ext2: correct max file size computing reiserfs: get rid of AOP_FLAG_CONT_EXPAND flag
2022-03-25Merge tag 'fsnotify_for_v5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "A few fsnotify improvements and cleanups" * tag 'fsnotify_for_v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: remove redundant parameter judgment fsnotify: optimize FS_MODIFY events with no ignored masks fsnotify: fix merge with parent's ignored mask
2022-03-25io_uring: fix memory leak of uid in files registrationPavel Begunkov
When there are no files for __io_sqe_files_scm() to process in the range, it'll free everything and return. However, it forgets to put uid. Fixes: 08a451739a9b5 ("io_uring: allow sparse fixed file sets") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/accee442376f33ce8aaebb099d04967533efde92.1648226048.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25Merge tag 'kbuild-gnu11-v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild update for C11 language base from Masahiro Yamada: "Kbuild -std=gnu11 updates for v5.18 Linus pointed out the benefits of C99 some years ago, especially variable declarations in loops [1]. At that time, we were not ready for the migration due to old compilers. Recently, Jakob Koschel reported a bug in list_for_each_entry(), which leaks the invalid pointer out of the loop [2]. In the discussion, we agreed that the time had come. Now that GCC 5.1 is the minimum compiler version, there is nothing to prevent us from going to -std=gnu99, or even straight to -std=gnu11. Discussions for a better list iterator implementation are ongoing, but this patch set must land first" [1] https://lore.kernel.org/all/CAHk-=wgr12JkKmRd21qh-se-_Gs69kbPgR9x4C+Es-yJV2GLkA@mail.gmail.com/ [2] https://lore.kernel.org/lkml/86C4CE7D-6D93-456B-AA82-F8ADEACA40B7@gmail.com/ * tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: Kbuild: use -std=gnu11 for KBUILD_USERCFLAGS Kbuild: move to -std=gnu11 Kbuild: use -Wdeclaration-after-statement Kbuild: add -Wno-shift-negative-value where -Wextra is used
2022-03-25[smb3] move more common protocol header definitions to smbfs_commonSteve French
We have duplicated definitions for various SMB3 PDUs in fs/ksmbd and fs/cifs. Some had already been moved to fs/smbfs_common/smb2pdu.h Move definitions for - error response - query info and various related protocol flags - various lease handling flags and the create lease context to smbfs_common/smb2pdu.h to reduce code duplication Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-03-25fs/iomap: Fix buffered write page prefaultingAndreas Gruenbacher
When part of the user buffer passed to generic_perform_write() or iomap_file_buffered_write() cannot be faulted in for reading, the entire write currently fails. The correct behavior would be to write all the data that can be written, up to the point of failure. Commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable") gave us the information needed, so fix the page prefaulting in generic_perform_write() and iomap_write_iter() to only bail out when no pages could be faulted in. We already factor in that pages that are faulted in may no longer be resident by the time they are accessed. Paging out pages has the same effect as not faulting in those pages in the first place, so the code can already deal with that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-25io_uring: fix put_kbuf without proper lockingPavel Begunkov
io_put_kbuf_comp() should only be called while holding ->completion_lock, however there is no such assumption in io_clean_op() and thus it can corrupt ->io_buffer_comp. Take the lock there, and workaround the only user of io_clean_op() calling it with locks. Not the prettiest solution, but it's easier to refactor it for-next. Fixes: cc3cec8367cba ("io_uring: speedup provided buffer handling") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/743e2130b73ec6d48c4c5dd15db896c433431e6d.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: fix invalid flags for io_put_kbuf()Pavel Begunkov
io_req_complete_failed() doesn't require callers to hold ->uring_lock, use IO_URING_F_UNLOCKED version of io_put_kbuf(). The only affected place is the fail path of io_apoll_task_func(). Also add a lockdep annotation to catch such bugs in the future. Fixes: 3b2b78a8eb7cc ("io_uring: extend provided buf return to fails") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ccf602dbf8df3b6a8552a262d8ee0a13a086fbc7.1648212967.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: improve req fields commentsPavel Begunkov
Move a misplaced comment about req->creds and add a line with assumptions about req->link. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1e51d1e6b1f3708c2d4127b4e371f9daa4c5f859.1648209006.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-25io_uring: enable EPOLLEXCLUSIVE for accept pollDylan Yudaken
When polling sockets for accept, use EPOLLEXCLUSIVE. This is helpful when multiple accept SQEs are submitted. For O_NONBLOCK sockets multiple queued SQEs would previously have all completed at once, but most with -EAGAIN as the result. Now only one wakes up and completes. For sockets without O_NONBLOCK there is no user facing change, but internally the extra requests would previously be queued onto a worker thread as they would wake up with no connection waiting, and be punted. Now they do not wake up unnecessarily. Co-developed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220325093755.4123343-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-24Merge tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "The highlights are: - several changes to how snap context and snap realms are tracked (Xiubo Li). In particular, this should resolve a long-standing issue of high kworker CPU usage and various stalls caused by needless iteration over all inodes in the snap realm. - async create fixes to address hangs in some edge cases (Jeff Layton) - support for getvxattr MDS op for querying server-side xattrs, such as file/directory layouts and ephemeral pins (Milind Changire) - average latency is now maintained for all metrics (Venky Shankar) - some tweaks around handling inline data to make it fit better with netfs helper library (David Howells) Also a couple of memory leaks got plugged along with a few assorted fixups. Last but not least, Xiubo has stepped up to serve as a CephFS co-maintainer" * tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-client: (27 commits) ceph: fix memory leak in ceph_readdir when note_last_dentry returns error ceph: uninitialized variable in debug output ceph: use tracked average r/w/m latencies to display metrics in debugfs ceph: include average/stdev r/w/m latency in mds metrics ceph: track average r/w/m latency ceph: use ktime_to_timespec64() rather than jiffies_to_timespec64() ceph: assign the ci only when the inode isn't NULL ceph: fix inode reference leakage in ceph_get_snapdir() ceph: misc fix for code style and logs ceph: allocate capsnap memory outside of ceph_queue_cap_snap() ceph: do not release the global snaprealm until unmounting ceph: remove incorrect and unused CEPH_INO_DOTDOT macro MAINTAINERS: add Xiubo Li as cephfs co-maintainer ceph: eliminate the recursion when rebuilding the snap context ceph: do not update snapshot context when there is no new snapshot ceph: zero the dir_entries memory when allocating it ceph: move to a dedicated slabcache for ceph_cap_snap ceph: add getvxattr op libceph: drop else branches in prepare_read_data{,_cont} ceph: fix comments mentioning i_mutex ...
2022-03-24Merge tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "The biggest change this cycle is bringing XFS' inode attribute setting code back towards alignment with what the VFS does. IOWs, setgid bit handling should be a closer match with ext4 and btrfs behavior. The rest of the branch is bug fixes around the filesystem -- patching gaps in quota enforcement, removing bogus selinux audit messages, and fixing log corruption and problems with log recovery. There will be a second pull request later on in the merge window with more bug fixes. Dave Chinner will be taking over as XFS maintainer for one release cycle, starting from the day 5.18-rc1 drops until 5.19-rc1 is tagged so that I can focus on starting a massive design review for the (feature complete after five years) online repair feature. Summary: - Fix some incorrect mapping state being passed to iomap during COW - Don't create bogus selinux audit messages when deciding to degrade gracefully due to lack of privilege - Fix setattr implementation to use VFS helpers so that we drop setgid consistently with the other filesystems - Fix link/unlink/rename to check quota limits - Constify xfs_name_dotdot to prevent abuse of in-kernel symbols - Fix log livelock between the AIL and inodegc threads during recovery - Fix a log stall when the AIL races with pushers - Fix stalls in CIL flushes due to pinned inode cluster buffers during recovery - Fix log corruption due to incorrect usage of xfs_is_shutdown vs xlog_is_shutdown because during an induced fs shutdown, AIL writeback must continue until the log is shut down, even if the filesystem has already shut down" * tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight xfs: AIL should be log centric xfs: log items should have a xlog pointer, not a mount xfs: async CIL flushes need pending pushes to be made stable xfs: xfs_ail_push_all_sync() stalls when racing with updates xfs: check buffer pin state after locking in delwri_submit xfs: log worker needs to start before intent/unlink recovery xfs: constify xfs_name_dotdot xfs: constify the name argument to various directory functions xfs: reserve quota for target dir expansion when renaming files xfs: reserve quota for dir expansion when linking/unlinking files xfs: refactor user/group quota chown in xfs_setattr_nonsize xfs: use setattr_copy to set vfs inode attributes xfs: don't generate selinux audit messages for capability testing xfs: add missing cmap->br_state = XFS_EXT_NORM update
2022-03-24Merge tag 'dax-for-5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX updates from Dan Williams: "Andrew has been shepherding major dax features that touch the core -mm through his tree, but I still collect the dax updates that are core-mm independent. - Fix a crash due to a missing rcu_barrier() in dax_fs_exit() - Fix two miscellaneous doc issues" * tag 'dax-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix missing kdoc for dax_device dax: make sure inodes are flushed before destroy cache fsdax: fix function description
2022-03-24io_uring: improve task work cache utilizationJens Axboe
While profiling task_work intensive workloads, I noticed that most of the time in tctx_task_work() is spending stalled on loading 'req'. This is one of the unfortunate side effects of using linked lists, particularly when they end up being passe around. Prefetch the next request, if there is one. There's a sufficient amount of work in between that this makes it available for the next loop. While fiddling with the cache layout, move the link outside of the hot completion cacheline. It's rarely used in hot workloads, so better to bring in kbuf which is used for networked loads with provided buffers. This reduces tctx_task_work() overhead from ~3% to 1-1.5% in my testing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-24gfs2: Make sure not to return short direct writesAndreas Gruenbacher
When direct writes fail with -ENOTBLK because we're writing into a hole (gfs2_iomap_begin()) or because of a page invalidation failure (iomap_dio_rw()), we're falling back to buffered writes. In that case, when we lose the inode glock in gfs2_file_buffered_write(), we want to re-acquire it instead of returning a short write. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-03-24gfs2: Remove dead code in gfs2_file_read_iterAndreas Gruenbacher
Function iomap_dio_rw() only returns -ENOTBLK for write requests and gfs2_file_direct_read() no longer returns -ENOTBLK since commit 1d45bb7f9d2a5 ("gfs2: Use iomap for stuffed direct I/O reads"), so there is no need to check for -ENOTBLK in gfs2_file_read_iter() anymore. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-03-24gfs2: Fix gfs2_file_buffered_write endless loop workaroundAndreas Gruenbacher
Since commit 554c577cee95b, gfs2_file_buffered_write() can accidentally return a truncated iov_iter, which might confuse callers. Fix that. Fixes: 554c577cee95b ("gfs2: Prevent endless loops in gfs2_file_buffered_write") Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2022-03-24io_uring: fix async accept on O_NONBLOCK socketsDylan Yudaken
Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this causes the accept to immediately post a CQE with EAGAIN, which means you cannot perform an accept SQE on a NONBLOCK socket asynchronously. By removing the flag if there is no pending accept then poll is armed as usual and when a connection comes in the CQE is posted. Signed-off-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220324143435.2875844-1-dylany@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "Various misc subsystems, before getting into the post-linux-next material. 41 patches. Subsystems affected by this patch series: procfs, misc, core-kernel, lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump, taskstats, panic, kcov, resource, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang" kernel/resource: fix kfree() of bootmem memory again kcov: properly handle subsequent mmap calls kcov: split ioctl handling into locked and unlocked parts panic: move panic_print before kmsg dumpers panic: add option to dump all CPUs backtraces in panic_print docs: sysctl/kernel: add missing bit to panic_print taskstats: remove unneeded dead assignment kasan: no need to unset panic_on_warn in end_report() ubsan: no need to unset panic_on_warn in ubsan_epilogue() panic: unset panic_on_warn inside panic() docs: kdump: add scp example to write out the dump file docs: kdump: update description about sysfs file system support arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible cgroup: use irqsave in cgroup_rstat_flush_locked(). fat: use pointer to simple type in put_user() minix: fix bug when opening a file with O_DIRECT ...
2022-03-24Merge tag 'flexible-array-transformations-5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull flexible-array transformations from Gustavo Silva: "Treewide patch that replaces zero-length arrays with flexible-array members. This has been baking in linux-next for a whole development cycle" * tag 'flexible-array-transformations-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: treewide: Replace zero-length arrays with flexible-array members
2022-03-24Merge tag 'fs.rt.v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount attributes PREEMPT_RT update from Christian Brauner: "This contains Sebastian's fix to make changing mount attributes/getting write access compatible with CONFIG_PREEMPT_RT. The change only applies when users explicitly opt-in to real-time via CONFIG_PREEMPT_RT otherwise things are exactly as before. We've waited quite a long time with this to make sure folks could take a good look" * tag 'fs.rt.v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs/namespace: Boost the mount_lock.lock owner instead of spinning on PREEMPT_RT.
2022-03-24Merge tag 'fs.v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount_setattr updates from Christian Brauner: "This contains a few more patches to massage the mount_setattr() codepaths and one minor fix to reuse a helper we added some time back. The final two patches do similar cleanups in different ways. One patch is mine and the other is Al's who was nice enough to give me a branch for it. Since his came in later and my branch had been sitting in -next for quite some time we just put his on top instead of swap them" * tag 'fs.v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: mount_setattr(): clean the control flow and calling conventions fs: clean up mount_setattr control flow fs: don't open-code mnt_hold_writers() fs: simplify check in mount_setattr_commit() fs: add mnt_allow_writers() and simplify mount_setattr_prepare()
2022-03-24NFSv4.1: don't retry BIND_CONN_TO_SESSION on session errorOlga Kornievskaia
There is no reason to retry the operation if a session error had occurred in such case result structure isn't filled out. Fixes: dff58530c4ca ("NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-24NFS: replace usage of found with dedicated list iterator variableJakob Koschel
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-24io_uring: remove IORING_CQE_F_MSGJens Axboe
This was introduced with the message ring opcode, but isn't strictly required for the request itself. The sender can encode what is needed in user_data, which is passed to the receiver. It's unclear if having a separate flag that essentially says "This CQE did not originate from an SQE on this ring" provides any real utility to applications. While we can always re-introduce a flag to provide this information, we cannot take it away at a later point in time. Remove the flag while we still can, before it's in a released kernel. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-23fat: use pointer to simple type in put_user()Helge Deller
The put_user(val,ptr) macro wants a pointer to a simple type, but in fat_ioctl_filldir() the d_name field references an "array of chars". Be more accurate and explicitly give the pointer to the first character of the d_name[] array. I noticed that issue while trying to optimize the parisc put_user() macro and used an intermediate variable to store the pointer. In that case I got this error: In file included from include/linux/uaccess.h:11, from include/linux/compat.h:17, from fs/fat/dir.c:18: fs/fat/dir.c: In function `fat_ioctl_filldir': fs/fat/dir.c:725:33: error: invalid initializer 725 | if (put_user(0, d2->d_name) || \ | ^~ include/asm/uaccess.h:152:33: note: in definition of macro `__put_user' 152 | __typeof__(ptr) __ptr = ptr; \ | ^~~ fs/fat/dir.c:759:1: note: in expansion of macro `FAT_IOCTL_FILLDIR_FUNC' 759 | FAT_IOCTL_FILLDIR_FUNC(fat_ioctl_filldir, __fat_dirent) Andreas Schwab <schwab@linux-m68k.org> suggested to use __typeof__(&*(ptr)) __ptr = ptr; instead. This works, but nevertheless it's probably reasonable to fix the original caller too. Link: https://lkml.kernel.org/r/Ygo+A9MREmC1H3kr@p100 Signed-off-by: Helge Deller <deller@gmx.de> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: David Laight <David.Laight@aculab.com> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23minix: fix bug when opening a file with O_DIRECTQinghua Jin
Testcase: 1. create a minix file system and mount it 2. open a file on the file system with O_RDWR|O_CREAT|O_TRUNC|O_DIRECT 3. open fails with -EINVAL but leaves an empty file behind. All other open() failures don't leave the failed open files behind. It is hard to check the direct_IO op before creating the inode. Just as ext4 and btrfs do, this patch will resolve the issue by allowing to create the file with O_DIRECT but returning error when writing the file. Link: https://lkml.kernel.org/r/20220107133626.413379-1-qhjin.dev@gmail.com Signed-off-by: Qinghua Jin <qhjin.dev@gmail.com> Reported-by: Colin Ian King <colin.king@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23fs/pipe.c: local vars have to match types of proper pipe_inode_info fieldsAndrei Vagin
head, tail, ring_size are declared as unsigned int, so all local variables that operate with these fields have to be unsigned to avoid signed integer overflow. Right now, it isn't an issue because the maximum pipe size is limited by 1U<<31. Link: https://lkml.kernel.org/r/20220106171946.36128-1-avagin@gmail.com Signed-off-by: Andrei Vagin <avagin@gmail.com> Suggested-by: Dmitry Safonov <0x7f454c46@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23fs/pipe: use kvcalloc to allocate a pipe_buffer arrayAndrei Vagin
Right now, kcalloc is used to allocate a pipe_buffer array. The size of the pipe_buffer struct is 40 bytes. kcalloc allows allocating reliably chunks with sizes less or equal to PAGE_ALLOC_COSTLY_ORDER (3). It means that the maximum pipe size is 3.2MB in this case. In CRIU, we use pipes to dump processes memory. CRIU freezes a target process, injects a parasite code into it and then this code splices memory into pipes. If a maximum pipe size is small, we need to do many iterations or create many pipes. kvcalloc attempt to allocate physically contiguous memory, but upon failure, fall back to non-contiguous (vmalloc) allocation and so it isn't limited by PAGE_ALLOC_COSTLY_ORDER. The maximum pipe size for non-root users is limited by the /proc/sys/fs/pipe-max-size sysctl that is 1MB by default, so only the root user will be able to trigger vmalloc allocations. Link: https://lkml.kernel.org/r/20220104171058.22580-1-avagin@gmail.com Signed-off-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23proc/vmcore: fix vmcore_alloc_buf() kernel-doc commentYang Li
Fix a spelling problem to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/proc/vmcore.c:492: warning: Function parameter or member 'size' not described in 'vmcore_alloc_buf' fs/proc/vmcore.c:492: warning: Excess function parameter 'sizez' description in 'vmcore_alloc_buf' Link: https://lkml.kernel.org/r/20220129011449.105278-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>