summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2024-04-26Merge tag 'vfs-6.9-rc6.fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes for this merge window and the attempt to handle the ntfs removal regression that was reported a little while ago: - After the removal of the legacy ntfs driver we received reports about regressions for some people that do mount "ntfs" explicitly and expect the driver to be available. Since ntfs3 is a drop-in for legacy ntfs we alias legacy ntfs to ntfs3 just like ext3 is aliased to ext4. We also enforce legacy ntfs is always mounted read-only and give it custom file operations to ensure that ioctl()'s can't be abused to perform write operations. - Fix an unbalanced module_get() in bdev_open(). - Two smaller fixes for the netfs work done earlier in this cycle. - Fix the errno returned from the new FS_IOC_GETUUID and FS_IOC_GETFSSYSFSPATH ioctls. Both commands just pull information out of the superblock so there's no need to call into the actual ioctl handlers. So instead of returning ENOIOCTLCMD to indicate to fallback we just return ENOTTY directly avoiding that indirection" * tag 'vfs-6.9-rc6.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs: Fix the pre-flush when appending to a file in writethrough mode netfs: Fix writethrough-mode error handling ntfs3: add legacy ntfs file operations ntfs3: enforce read-only when used as legacy ntfs driver ntfs3: serve as alias for the legacy ntfs driver block: fix module reference leakage from bdev_open_by_dev error path fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail
2024-04-26netfs: Fix the pre-flush when appending to a file in writethrough modeDavid Howells
In netfs_perform_write(), when the file is marked NETFS_ICTX_WRITETHROUGH or O_*SYNC or RWF_*SYNC was specified, write-through caching is performed on a buffered file. When setting up for write-through, we flush any conflicting writes in the region and wait for the write to complete, failing if there's a write error to return. The issue arises if we're writing at or above the EOF position because we skip the flush and - more importantly - the wait. This becomes a problem if there's a partial folio at the end of the file that is being written out and we want to make a write to it too. Both the already-running write and the write we start both want to clear the writeback mark, but whoever is second causes a warning looking something like: ------------[ cut here ]------------ R=00000012: folio 11 is not under writeback WARNING: CPU: 34 PID: 654 at fs/netfs/write_collect.c:105 ... CPU: 34 PID: 654 Comm: kworker/u386:27 Tainted: G S ... ... Workqueue: events_unbound netfs_write_collection_worker ... RIP: 0010:netfs_writeback_lookup_folio Fix this by making the flush-and-wait unconditional. It will do nothing if there are no folios in the pagecache and will return quickly if there are no folios in the region specified. Further, move the WBC attachment above the flush call as the flush is going to attach a WBC and detach it again if it is not present - and since we need one anyway we might as well share it. Fixes: 41d8e7673a77 ("netfs: Implement a write-through caching option") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202404161031.468b84f-oliver.sang@intel.com Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/2150448.1714130115@warthog.procyon.org.uk Reviewed-by: Jeffrey Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: linux-mm@kvack.org cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-25Merge tag '9p-for-6.9-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull 9p fix from Eric Van Hensbergen: "This contains a single mitigation to help deal with an apparent race condition between client and server having to deal with inode number collisions" * tag '9p-for-6.9-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: mitigate inode collisions
2024-04-25Merge tag 'nfsd-6.9-5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Revert some backchannel fixes that went into v6.9-rc * tag 'nfsd-6.9-5' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: Revert "NFSD: Convert the callback workqueue to use delayed_work" Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"
2024-04-24Merge tag 'for-6.9-rc5-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix information leak by the buffer returned from LOGICAL_INO ioctl - fix flipped condition in scrub when tracking sectors in zoned mode - fix calculation when dropping extent range - reinstate fallback to write uncompressed data in case of fragmented space that could not store the entire compressed chunk - minor fix to message formatting style to make it conforming to the commonly used style * tag 'for-6.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix wrong block_start calculation for btrfs_drop_extent_map_range() btrfs: fix information leak in btrfs_ioctl_logical_to_ino() btrfs: fallback if compressed IO fails for ENOSPC btrfs: scrub: run relocation repair when/only needed btrfs: remove colon from messages with state
2024-04-23Revert "NFSD: Convert the callback workqueue to use delayed_work"Chuck Lever
This commit was a pre-requisite for commit c1ccfcf1a9bf ("NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"), which has already been reverted. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-23Revert "NFSD: Reschedule CB operations when backchannel rpc_clnt is shut down"Chuck Lever
The reverted commit attempted to enable NFSD to retransmit pending callback operations if an NFS client disconnects, but unintentionally introduces a hazardous behavior regression if the client becomes permanently unreachable while callback operations are still pending. A disconnect can occur due to network partition or if the NFS server needs to force the NFS client to retransmit (for example, if a GSS window under-run occurs). Reverting the commit will make NFSD behave the same as it did in v6.8 and before. Pending callback operations are permanently lost if the client connection is terminated before the client receives them. For some callback operations, this loss is not harmful. However, for CB_RECALL, the loss means a delegation might be revoked unnecessarily. For CB_OFFLOAD, pending COPY operations will never complete unless the NFS client subsequently sends an OFFLOAD_STATUS operation, which the Linux NFS client does not currently implement. These issues still need to be addressed somehow. Reported-by: Dai Ngo <dai.ngo@oracle.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218735 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-04-23Merge tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull smb client fixes from Steve French: - fscache fix - fix for case where we could use uninitialized lease - add tracepoint for debugging refcounting of tcon - fix mount option regression (e.g. forceuid vs. noforceuid when uid= specified) caused by conversion to the new mount API * tag '6.9-rc5-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: reinstate original behavior again for forceuid/forcegid smb: client: fix rename(2) regression against samba cifs: Add tracing for the cifs_tcon struct refcounting cifs: Fix reacquisition of volume cookie on still-live connection
2024-04-23netfs: Fix writethrough-mode error handlingDavid Howells
Fix the error return in netfs_perform_write() acting in writethrough-mode to return any cached error in the case that netfs_end_writethrough() returns 0. This can affect the use of O_SYNC/O_DSYNC/RWF_SYNC/RWF_DSYNC in 9p and afs. Fixes: 41d8e7673a77 ("netfs: Implement a write-through caching option") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/6736.1713343639@warthog.procyon.org.uk Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Dominique Martinet <asmadeus@codewreck.org> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-23ntfs3: add legacy ntfs file operationsChristian Brauner
To ensure that ioctl()s can't be used to circumvent write restrictions. Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-23ntfs3: enforce read-only when used as legacy ntfs driverChristian Brauner
Ensure that ntfs3 is mounted read-only when it is used to provide the legacy ntfs driver. Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-04-22Merge tag '6.9-rc5-ksmbd-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull smb server fixes from Steve French: "Five ksmbd server fixes, most also for stable: - rename fix - two fixes for potential out of bounds - fix for connections from MacOS (padding in close response) - fix for when to enable persistent handles" * tag '6.9-rc5-ksmbd-fixes' of git://git.samba.org/ksmbd: ksmbd: add continuous availability share parameter ksmbd: common: use struct_group_attr instead of struct_group for network_open_info ksmbd: clear RENAME_NOREPLACE before calling vfs_rename ksmbd: validate request buffer size in smb2_allocate_rsp_buf() ksmbd: fix slab-out-of-bounds in smb2_allocate_rsp_buf
2024-04-22Merge tag 'bcachefs-2024-04-22' of https://evilpiepirate.org/git/bcachefsLinus Torvalds
Pull bcachefs fixes from Kent Overstreet: "Nothing too crazy in this one, and it looks like (fingers crossed) the recovery and repair issues are settling down - although there's going to be a long tail there, as we've still yet to really ramp up on error injection or syzbot. - fix a few more deadlocks in recovery - fix u32/u64 issues in mi_btree_bitmap - btree key cache shrinker now actually frees, with more instrumentation coming so we can verify that it's working correctly more easily in the future" * tag 'bcachefs-2024-04-22' of https://evilpiepirate.org/git/bcachefs: bcachefs: If we run merges at a lower watermark, they must be nonblocking bcachefs: Fix inode early destruction path bcachefs: Fix deadlock in journal write path bcachefs: Tweak btree key cache shrinker so it actually frees bcachefs: bkey_cached.btree_trans_barrier_seq needs to be a ulong bcachefs: Fix missing call to bch2_fs_allocator_background_exit() bcachefs: Check for journal entries overruning end of sb clean section bcachefs: Fix bio alloc in check_extent_checksum() bcachefs: fix leak in bch2_gc_write_reflink_key bcachefs: KEY_TYPE_error is allowed for reflink bcachefs: Fix bch2_dev_btree_bitmap_marked_sectors() shift bcachefs: make sure to release last journal pin in replay bcachefs: node scan: ignore multiple nodes with same seq if interior bcachefs: Fix format specifier in validate_bset_keys() bcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINE
2024-04-22cifs: reinstate original behavior again for forceuid/forcegidTakayuki Nagata
forceuid/forcegid should be enabled by default when uid=/gid= options are specified, but commit 24e0a1eff9e2 ("cifs: switch to new mount api") changed the behavior. Due to the change, a mounted share does not show intentional uid/gid for files and directories even though uid=/gid= options are specified since forceuid/forcegid are not enabled. This patch reinstates original behavior that overrides uid/gid with specified uid/gid by the options. Fixes: 24e0a1eff9e2 ("cifs: switch to new mount api") Signed-off-by: Takayuki Nagata <tnagata@redhat.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Acked-by: Ronnie Sahlberg <lsahlber@redhat.com> Acked-by: Tom Talpey <tom@talpey.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-22 fs/9p: mitigate inode collisionsEric Van Hensbergen
Detect and mitigate inode collsions that now occur since we fixed 9p generating duplicate inode structures. Underlying cause of these appears to be a race condition between reuse of inode numbers in underlying file system and cleanup of inode numbers in the client. Enabling caching makes this much more likely to happen as it increases cleanup latency due to writebacks. Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-04-22bcachefs: If we run merges at a lower watermark, they must be nonblockingbcachefs-2024-04-22Kent Overstreet
Fix another deadlock related to the merge path; previously, we switched to always running merges at a lower watermark (because they are noncritical); but when we run at a lower watermark we also need to run nonblocking or we've introduced a new deadlock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reported-and-tested-by: s@m-h.ug
2024-04-21Merge tag 'driver-core-6.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull kernfs bugfix and documentation update from Greg KH: "Here are two changes for 6.9-rc5 that deal with "driver core" stuff, that do the following: - sysfs reference leak fix - embargoed-hardware-issues.rst update for Power Both of these have been in linux-next for over a week with no reported issues" * tag 'driver-core-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: Documentation: embargoed-hardware-issues.rst: Add myself for Power fs: sysfs: Fix reference leak in sysfs_break_active_protection()
2024-04-20bcachefs: Fix inode early destruction pathKent Overstreet
discard_new_inode() is the wrong interface to use when we need to free an inode that was never inserted into the inode hash table; we can bypass the whole iput() -> evict() path and replace it with __destroy_inode(); kmem_cache_free() - this fixes a WARN_ON() about I_NEW. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20bcachefs: Fix deadlock in journal write pathKent Overstreet
bch2_journal_write() was incorrectly waiting on earlier journal writes synchronously; this usually worked because most of the time we'd be running in the context of a thread that did a journal_buf_put(), but sometimes we'd be running out of the same workqueue that completes those prior journal writes. Additionally, this makes sure to punt to a workqueue before submitting preflushes - we really don't want to be calling submit_bio() in the main transaction commit path. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20bcachefs: Tweak btree key cache shrinker so it actually freesKent Overstreet
Freeing key cache items is a multi stage process; we need to wait for an SRCU grace period to elapse, and we handle this ourselves - partially to avoid callback overhead, but primarily so that when allocating we can first allocate from the freed items waiting for an SRCU grace period. Previously, the shrinker was counting the items on the 'waiting for SRCU grace period' lists as items being scanned, but this meant that too many items waiting for an SRCU grace period could prevent it from doing any work at all. After this, we're seeing that items skipped due to the accessed bit are the main cause of the shrinker not making any progress, and we actually want the key cache shrinker to run quite aggressively because reclaimed items will still generally be found (more compactly) in the btree node cache - so we also tweak the shrinker to not count those against nr_to_scan. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20bcachefs: bkey_cached.btree_trans_barrier_seq needs to be a ulongKent Overstreet
this stores the SRCU sequence number, which we use to check if an SRCU barrier has elapsed; this is a partial fix for the key cache shrinker not actually freeing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20bcachefs: Fix missing call to bch2_fs_allocator_background_exit()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-20bcachefs: Check for journal entries overruning end of sb clean sectionKent Overstreet
Fix a missing bounds check in superblock validation. Note that we don't yet have repair code for this case - repair code for individual items is generally low priority, since the whole superblock is checksummed, validated prior to write, and we have backups. Reported-by: lei lu <llfamsec@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-19ksmbd: add continuous availability share parameterNamjae Jeon
If capabilities of the share is not SMB2_SHARE_CAP_CONTINUOUS_AVAILABILITY, ksmbd should not grant a persistent handle to the client. This patch add continuous availability share parameter to control it. Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19ksmbd: common: use struct_group_attr instead of struct_group for ↵Namjae Jeon
network_open_info 4byte padding cause the connection issue with the applications of MacOS. smb2_close response size increases by 4 bytes by padding, And the smb client of MacOS check it and stop the connection. This patch use struct_group_attr instead of struct_group for network_open_info to use __packed to avoid padding. Fixes: 0015eb6e1238 ("smb: client, common: fix fortify warnings") Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19ksmbd: clear RENAME_NOREPLACE before calling vfs_renameMarios Makassikis
File overwrite case is explicitly handled, so it is not necessary to pass RENAME_NOREPLACE to vfs_rename. Clearing the flag fixes rename operations when the share is a ntfs-3g mount. The latter uses an older version of fuse with no support for flags in the ->rename op. Cc: stable@vger.kernel.org Signed-off-by: Marios Makassikis <mmakassikis@freebox.fr> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19ksmbd: validate request buffer size in smb2_allocate_rsp_buf()Namjae Jeon
The response buffer should be allocated in smb2_allocate_rsp_buf before validating request. But the fields in payload as well as smb2 header is used in smb2_allocate_rsp_buf(). This patch add simple buffer size validation to avoid potencial out-of-bounds in request buffer. Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19ksmbd: fix slab-out-of-bounds in smb2_allocate_rsp_bufNamjae Jeon
If ->ProtocolId is SMB2_TRANSFORM_PROTO_NUM, smb2 request size validation could be skipped. if request size is smaller than sizeof(struct smb2_query_info_req), slab-out-of-bounds read can happen in smb2_allocate_rsp_buf(). This patch allocate response buffer after decrypting transform request. smb3_decrypt_req() will validate transform request size and avoid slab-out-of-bound in smb2_allocate_rsp_buf(). Reported-by: Norbert Szetei <norbert@doyensec.com> Cc: stable@vger.kernel.org Signed-off-by: Namjae Jeon <linkinjeon@kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19smb: client: fix rename(2) regression against sambaPaulo Alcantara
After commit 2c7d399e551c ("smb: client: reuse file lease key in compound operations") the client started reusing lease keys for rename, unlink and set path size operations to prevent it from breaking its own leases and thus causing unnecessary lease breaks to same connection. The implementation relies on positive dentries and cifsInodeInfo::lease_granted to decide whether reusing lease keys for the compound requests. cifsInodeInfo::lease_granted was introduced by commit 0ab95c2510b6 ("Defer close only when lease is enabled.") to indicate whether lease caching is granted for a specific file, but that can only happen until file is open, so cifsInodeInfo::lease_granted was left uninitialised in ->alloc_inode and then client started sending random lease keys for files that hadn't any leases. This fixes the following test case against samba: mount.cifs //srv/share /mnt/1 -o ...,nosharesock mount.cifs //srv/share /mnt/2 -o ...,nosharesock touch /mnt/1/foo; tail -f /mnt/1/foo & pid=$! mv /mnt/2/foo /mnt/2/bar # fails with -EIO kill $pid Fixes: 0ab95c2510b6 ("Defer close only when lease is enabled.") Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19cifs: Add tracing for the cifs_tcon struct refcountingDavid Howells
Add tracing for the refcounting/lifecycle of the cifs_tcon struct, marking different events with different labels and giving each tcon its own debug ID so that the tracelines corresponding to individual tcons can be distinguished. This can be enabled with: echo 1 >/sys/kernel/debug/tracing/events/cifs/smb3_tcon_ref/enable Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> cc: Shyam Prasad N <nspmangalore@gmail.com> cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19cifs: Fix reacquisition of volume cookie on still-live connectionDavid Howells
During mount, cifs_mount_get_tcon() gets a tcon resource connection record and then attaches an fscache volume cookie to it. However, it does this irrespective of whether or not the tcon returned from cifs_get_tcon() is a new record or one that's already in use. This leads to a warning about a volume cookie collision and a leaked volume cookie because tcon->fscache gets reset. Fix this be adding a mutex and a "we've already tried this" flag and only doing it once for the lifetime of the tcon. [!] Note: Looking at cifs_mount_get_tcon(), a more general solution may actually be required. Reacquiring the volume cookie isn't the only thing that function does: it also partially reinitialises the tcon record without any locking - which may cause live filesystem ops already using the tcon through a previous mount to malfunction. This can be reproduced simply by something like: mount //example.com/test /xfstest.test -o user=shares,pass=xxx,fsc mount //example.com/test /mnt -o user=shares,pass=xxx,fsc Fixes: 70431bfd825d ("cifs: Support fscache indexing rewrite") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com> cc: Shyam Prasad N <sprasad@microsoft.com> cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2024-04-19Merge tag '9p-fixes-for-6.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs Pull fs/9p fixes from Eric Van Hensbergen: "This contains a reversion of one of the original 6.9 patches which seems to have been the cause of most of the instability. It also incorporates several fixes to legacy support and cache fixes. There are few additional changes to improve stability, but I want another week of testing before sending them upstream" * tag '9p-fixes-for-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: fs/9p: drop inodes immediately on non-.L too fs/9p: Revert "fs/9p: fix dups even in uncached mode" fs/9p: remove erroneous nlink init from legacy stat2inode 9p: explicitly deny setlease attempts fs/9p: fix the cache always being enabled on files with qid flags fs/9p: translate O_TRUNC into OTRUNC fs/9p: only translate RWX permissions for plain 9P2000
2024-04-19Merge tag 'fuse-fixes-6.9-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: - Fix two bugs in the new passthrough mode - Fix a statx bug introduced in v6.6 - Fix code documentation * tag 'fuse-fixes-6.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: cuse: add kernel-doc comments to cuse_process_init_reply() fuse: fix leaked ENOSYS error on first statx call fuse: fix parallel dio write on file open in passthrough mode fuse: fix wrong ff->iomode state changes from parallel dio write
2024-04-19Merge tag 'mm-hotfixes-stable-2024-04-18-14-41' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "15 hotfixes. 9 are cc:stable and the remainder address post-6.8 issues or aren't considered suitable for backporting. There are a significant number of fixups for this cycle's page_owner changes (series "page_owner: print stacks and their outstanding allocations"). Apart from that, singleton changes all over, mainly in MM" * tag 'mm-hotfixes-stable-2024-04-18-14-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: nilfs2: fix OOB in nilfs_set_de_type MAINTAINERS: update Naoya Horiguchi's email address fork: defer linking file vma until vma is fully initialized mm/shmem: inline shmem_is_huge() for disabled transparent hugepages mm,page_owner: defer enablement of static branch Squashfs: check the inode number is not the invalid value of zero mm,swapops: update check in is_pfn_swap_entry for hwpoison entries mm/memory-failure: fix deadlock when hugetlb_optimize_vmemmap is enabled mm/userfaultfd: allow hugetlb change protection upon poison entry mm,page_owner: fix printing of stack records mm,page_owner: fix accounting of pages when migrating mm,page_owner: fix refcount imbalance mm,page_owner: update metadata for tail pages userfaultfd: change src_folio after ensuring it's unpinned in UFFDIO_MOVE mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly
2024-04-18btrfs: fix wrong block_start calculation for btrfs_drop_extent_map_range()Qu Wenruo
[BUG] During my extent_map cleanup/refactor, with extra sanity checks, extent-map-tests::test_case_7() would not pass the checks. The problem is, after btrfs_drop_extent_map_range(), the resulted extent_map has a @block_start way too large. Meanwhile my btrfs_file_extent_item based members are returning a correct @disk_bytenr/@offset combination. The extent map layout looks like this: 0 16K 32K 48K | PINNED | | Regular | The regular em at [32K, 48K) also has 32K @block_start. Then drop range [0, 36K), which should shrink the regular one to be [36K, 48K). However the @block_start is incorrect, we expect 32K + 4K, but got 52K. [CAUSE] Inside btrfs_drop_extent_map_range() function, if we hit an extent_map that covers the target range but is still beyond it, we need to split that extent map into half: |<-- drop range -->| |<----- existing extent_map --->| And if the extent map is not compressed, we need to forward extent_map::block_start by the difference between the end of drop range and the extent map start. However in that particular case, the difference is calculated using (start + len - em->start). The problem is @start can be modified if the drop range covers any pinned extent. This leads to wrong calculation, and would be caught by my later extent_map sanity checks, which checks the em::block_start against btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset. This is a regression caused by commit c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range"), which removed the @len update for pinned extents. [FIX] Fix it by avoiding using @start completely, and use @end - em->start instead, which @end is exclusive bytenr number. And update the test case to verify the @block_start to prevent such problem from happening. Thankfully this is not going to lead to any data corruption, as IO path does not utilize btrfs_drop_extent_map_range() with @skip_pinned set. So this fix is only here for the sake of consistency/correctness. CC: stable@vger.kernel.org # 6.5+ Fixes: c962098ca4af ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18btrfs: fix information leak in btrfs_ioctl_logical_to_ino()Johannes Thumshirn
Syzbot reported the following information leak for in btrfs_ioctl_logical_to_ino(): BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline] BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40 instrument_copy_to_user include/linux/instrumented.h:114 [inline] _copy_to_user+0xbc/0x110 lib/usercopy.c:40 copy_to_user include/linux/uaccess.h:191 [inline] btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499 btrfs_ioctl+0x714/0x1260 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:904 [inline] __se_sys_ioctl+0x261/0x450 fs/ioctl.c:890 __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890 x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was created at: __kmalloc_large_node+0x231/0x370 mm/slub.c:3921 __do_kmalloc_node mm/slub.c:3954 [inline] __kmalloc_node+0xb07/0x1060 mm/slub.c:3973 kmalloc_node include/linux/slab.h:648 [inline] kvmalloc_node+0xc0/0x2d0 mm/util.c:634 kvmalloc include/linux/slab.h:766 [inline] init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779 btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480 btrfs_ioctl+0x714/0x1260 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:904 [inline] __se_sys_ioctl+0x261/0x450 fs/ioctl.c:890 __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890 x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Bytes 40-65535 of 65536 are uninitialized Memory access of size 65536 starts at ffff888045a40000 This happens, because we're copying a 'struct btrfs_data_container' back to user-space. This btrfs_data_container is allocated in 'init_data_container()' via kvmalloc(), which does not zero-fill the memory. Fix this by using kvzalloc() which zeroes out the memory on allocation. CC: stable@vger.kernel.org # 4.14+ Reported-by: <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <Johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-17Merge tag 'for-6.9-rc4-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fixup in zoned mode for out-of-order writes of metadata that are no longer necessary, this used to be tracked in a separate list but now the old locaion needs to be zeroed out, also add assertions - fix bulk page allocation retry, this may stall after first failure for compression read/write * tag 'for-6.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: do not wait for short bulk allocation btrfs: zoned: add ASSERT and WARN for EXTENT_BUFFER_ZONED_ZEROOUT handling btrfs: zoned: do not flag ZEROOUT on non-dirty extent buffer
2024-04-18btrfs: fallback if compressed IO fails for ENOSPCSweet Tea Dorminy
In commit b4ccace878f4 ("btrfs: refactor submit_compressed_extents()"), if an async extent compressed but failed to find enough space, we changed from falling back to an uncompressed write to just failing the write altogether. The principle was that if there's not enough space to write the compressed version of the data, there can't possibly be enough space to write the larger, uncompressed version of the data. However, this isn't necessarily true: due to fragmentation, there could be enough discontiguous free blocks to write the uncompressed version, but not enough contiguous free blocks to write the smaller but unsplittable compressed version. This has occurred to an internal workload which relied on write()'s return value indicating there was space. While rare, it has happened a few times. Thus, in order to prevent early ENOSPC, re-add a fallback to uncompressed writing. Fixes: b4ccace878f4 ("btrfs: refactor submit_compressed_extents()") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Co-developed-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Neal Gompa <neal@gompa.dev> Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18btrfs: scrub: run relocation repair when/only neededNaohiro Aota
When btrfs scrub finds an error, it reads mirrors to find correct data. If all the errors are fixed, sctx->error_bitmap is cleared for the stripe range. However, in the zoned mode, it runs relocation to repair scrub errors when the bitmap is *not* empty, which is a flipped condition. Also, it runs the relocation even if the scrub is read-only. This was missed by a fix in commit 1f2030ff6e49 ("btrfs: scrub: respect the read-only flag during repair"). The repair is only necessary when there is a repaired sector and should be done on read-write scrub. So, tweak the condition for both regular and zoned case. Fixes: 54765392a1b9 ("btrfs: scrub: introduce helper to queue a stripe for scrub") Fixes: 1f2030ff6e49 ("btrfs: scrub: respect the read-only flag during repair") CC: stable@vger.kernel.org # 6.6+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-18btrfs: remove colon from messages with stateDavid Sterba
The message format in syslog is usually made of two parts: prefix ":" message Various tools parse the prefix up to the first ":". When there's an additional status of a btrfs filesystem like [5.199782] BTRFS info (device nvme1n1p1: state M): use zstd compression, level 9 where 'M' is for remount, there's one more ":" that does not conform to the format. Remove it. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-04-17bcachefs: Fix bio alloc in check_extent_checksum()Kent Overstreet
if the buffer is virtually mapped it won't be a single bvec Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17bcachefs: fix leak in bch2_gc_write_reflink_keyKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17bcachefs: KEY_TYPE_error is allowed for reflinkKent Overstreet
KEY_TYPE_error is left behind when we have to delete all pointers in an extent in fsck; it allows errors to be correctly returned by reads later. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-17bcachefs: Fix bch2_dev_btree_bitmap_marked_sectors() shiftKent Overstreet
Fixes: 27c15ed297cb bcachefs: bch_member.btree_allocated_bitmap Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16bcachefs: make sure to release last journal pin in replayKent Overstreet
This fixes a deadlock when journal replay has many keys to insert that were from fsck, not the journal. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16bcachefs: node scan: ignore multiple nodes with same seq if interiorKent Overstreet
Interior nodes are not really needed, when we have to scan - but if this pops up for leaf nodes we'll need a real heuristic. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16bcachefs: Fix format specifier in validate_bset_keys()Nathan Chancellor
When building for 32-bit platforms, for which size_t is 'unsigned int', there is a warning from a format string in validate_bset_keys(): fs/bcachefs/btree_io.c: In function 'validate_bset_keys': fs/bcachefs/btree_io.c:891:34: error: format '%lu' expects argument of type 'long unsigned int', but argument 12 has type 'unsigned int' [-Werror=format=] 891 | "bad k->u64s %u (min %u max %lu)", k->u64s, | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/bcachefs/btree_io.c:603:32: note: in definition of macro 'btree_err' 603 | msg, ##__VA_ARGS__); \ | ^~~ fs/bcachefs/btree_io.c:887:21: note: in expansion of macro 'btree_err_on' 887 | if (btree_err_on(!bkeyp_u64s_valid(&b->format, k), | ^~~~~~~~~~~~ fs/bcachefs/btree_io.c:891:64: note: format string is defined here 891 | "bad k->u64s %u (min %u max %lu)", k->u64s, | ~~^ | | | long unsigned int | %u cc1: all warnings being treated as errors BKEY_U64s is size_t so the entire expression is promoted to size_t. Use the '%zu' specifier so that there is no warning regardless of the width of size_t. Fixes: 031ad9e7dbd1 ("bcachefs: Check for packed bkeys that are too big") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202404130747.wH6Dd23p-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202404131536.HdAMBOVc-lkp@intel.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16bcachefs: Fix null ptr deref in twf from BCH_IOCTL_FSCK_OFFLINEKent Overstreet
We need to initialize the stdio redirects before they're used. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-04-16nilfs2: fix OOB in nilfs_set_de_typeJeongjun Park
The size of the nilfs_type_by_mode array in the fs/nilfs2/dir.c file is defined as "S_IFMT >> S_SHIFT", but the nilfs_set_de_type() function, which uses this array, specifies the index to read from the array in the same way as "(mode & S_IFMT) >> S_SHIFT". static void nilfs_set_de_type(struct nilfs_dir_entry *de, struct inode *inode) { umode_t mode = inode->i_mode; de->file_type = nilfs_type_by_mode[(mode & S_IFMT)>>S_SHIFT]; // oob } However, when the index is determined this way, an out-of-bounds (OOB) error occurs by referring to an index that is 1 larger than the array size when the condition "mode & S_IFMT == S_IFMT" is satisfied. Therefore, a patch to resize the nilfs_type_by_mode array should be applied to prevent OOB errors. Link: https://lkml.kernel.org/r/20240415182048.7144-1-konishi.ryusuke@gmail.com Reported-by: syzbot+2e22057de05b9f3b30d8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=2e22057de05b9f3b30d8 Fixes: 2ba466d74ed7 ("nilfs2: directory entry operations") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-16Squashfs: check the inode number is not the invalid value of zeroPhillip Lougher
Syskiller has produced an out of bounds access in fill_meta_index(). That out of bounds access is ultimately caused because the inode has an inode number with the invalid value of zero, which was not checked. The reason this causes the out of bounds access is due to following sequence of events: 1. Fill_meta_index() is called to allocate (via empty_meta_index()) and fill a metadata index. It however suffers a data read error and aborts, invalidating the newly returned empty metadata index. It does this by setting the inode number of the index to zero, which means unused (zero is not a valid inode number). 2. When fill_meta_index() is subsequently called again on another read operation, locate_meta_index() returns the previous index because it matches the inode number of 0. Because this index has been returned it is expected to have been filled, and because it hasn't been, an out of bounds access is performed. This patch adds a sanity check which checks that the inode number is not zero when the inode is created and returns -EINVAL if it is. [phillip@squashfs.org.uk: whitespace fix] Link: https://lkml.kernel.org/r/20240409204723.446925-1-phillip@squashfs.org.uk Link: https://lkml.kernel.org/r/20240408220206.435788-1-phillip@squashfs.org.uk Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk> Reported-by: "Ubisectech Sirius" <bugreport@ubisectech.com> Closes: https://lore.kernel.org/lkml/87f5c007-b8a5-41ae-8b57-431e924c5915.bugreport@ubisectech.com/ Cc: Christian Brauner <brauner@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>