summaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2021-06-03xfs: refactor per-AG inode tagging functionsinode-walk-cleanups-5.14_2021-06-06inode-walk-cleanups-5.14_2021-06-05inode-walk-cleanups-5.14_2021-06-03Darrick J. Wong
In preparation for adding another incore inode tree tag, refactor the code that sets and clears tags from the per-AG inode tree and the tree of per-AG structures, and remove the open-coded versions used by the blockgc code. Note: For reclaim, we now rely on the radix tree tags instead of the reclaimable inode count more heavily than we used to. The conversion should be fine, but the logic isn't 100% identical. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: merge xfs_reclaim_inodes_ag into xfs_inode_walk_agDarrick J. Wong
Merge these two inode walk loops together, since they're pretty similar now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: pass struct xfs_eofblocks to the inode scan callbackDarrick J. Wong
Pass a pointer to the actual eofb structure around the inode scanner functions instead of a void pointer, now that none of the functions is used as a callback. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: fix radix tree tag signsDarrick J. Wong
Radix tree tags are supposed to be unsigned ints, so fix the callers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: make the icwalk processing functions clean up the grab stateDarrick J. Wong
Soon we're going to be adding two new callers to the incore inode walk code: reclaim of incore inodes, and (later) inactivation of inodes. Both states operate on inodes that no longer have any VFS state, so we need to move the xfs_irele calls into the processing functions. In other words, icwalk processing functions are responsible for cleaning up whatever state changes are made by the corresponding icwalk igrab function that picked the inode for processing. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: clean up inode state flag tests in xfs_blockgc_igrabDarrick J. Wong
Clean up the definition of which inode states are not eligible for speculative preallocation garbage collecting by creating a private #define. The deferred inactivation patchset will add two new entries to the set of flags-to-ignore, so we want the definition not to end up a cluttered mess. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: remove indirect calls from xfs_inode_walk{,_ag}Darrick J. Wong
It turns out that there is a 1:1 mapping between the execute and goal parameters that are passed to xfs_inode_walk_ag: xfs_blockgc_scan_inode <=> XFS_ICWALK_BLOCKGC xfs_dqrele_inode <=> XFS_ICWALK_DQRELE Because of this exact correspondence, we don't need the execute function pointer and can replace it with a direct call. For the price of a forward static declaration, we can eliminate the indirect function call. This likely has a negligible impact on performance (since the execute function runs transactions), but it also simplifies the function signature. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: remove iter_flags parameter from xfs_inode_walk_*Darrick J. Wong
The sole iter_flags is XFS_INODE_WALK_INEW_WAIT, and there are no users. Remove the flag, and the parameter, and all the code that used it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: move xfs_inew_wait call into xfs_dqrele_inodeDarrick J. Wong
Move the INEW wait into xfs_dqrele_inode so that we can drop the iter_flags parameter in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: separate the dqrele_all inode grab logic from xfs_inode_walk_ag_grabDarrick J. Wong
Disentangle the dqrele_all inode grab code from the "generic" inode walk grabbing code, and and use the opportunity to document why the dqrele grab function does what it does. Since xfs_inode_walk_ag_grab is now only used for blockgc, rename it to reflect that. Ultimately, there will be four reasons to perform a walk of incore inodes: quotaoff dquote releasing (dqrele), garbage collection of speculative preallocations (blockgc), reclamation of incore inodes (reclaim), and deferred inactivation (inodegc). Each of these four have their own slightly different criteria for deciding if they want to handle an inode, so it makes more sense to have four cohesive igrab functions than one confusing parameteric grab function like we do now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: pass the goal of the incore inode walk to xfs_inode_walk()Darrick J. Wong
As part of removing the indirect calls and radix tag implementation details from the incore inode walk loop, create an enum to represent the goal of the inode iteration. More immediately, this separate removes the need for the "ICI_NOTAG" define which makes little sense. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: rename xfs_inode_walk functions to xfs_icwalkDarrick J. Wong
Shorten the prefix so that all the incore inode cache walk code has "xfs_icwalk" in the name somewhere. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: move the inode walk functions further downDarrick J. Wong
Move the inode walk functions further down in the file to limit the forward declarations to the two walk functions as we add new code that uses the inode walks. We'll clean them out later (i.e. after the deferred inode inactivation series). Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: detach inode dquots at the end of inactivationDarrick J. Wong
Once we're done with inactivating an inode, we're finished updating metadata for that inode. This means that we can detach the dquots at the end and not have to wait for reclaim to do it for us. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-03xfs: move the quotaoff dqrele inode walk into xfs_icache.cDarrick J. Wong
The only external caller of xfs_inode_walk* happens in quotaoff, when we want to walk all the incore inodes to detach the dquots. Move this code to xfs_icache.c so that we can hide xfs_inode_walk as the starting step in more cleanups of inode walks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-06-02xfs: don't take a spinlock unconditionally in the DIO fastpathxfs-merge-5.14_2021-06-02assorted-fixes-5.14-1_2021-06-06assorted-fixes-5.14-1_2021-06-05assorted-fixes-5.14-1_2021-06-03Dave Chinner
Because this happens at high thread counts on high IOPS devices doing mixed read/write AIO-DIO to a single file at about a million iops: 64.09% 0.21% [kernel] [k] io_submit_one - 63.87% io_submit_one - 44.33% aio_write - 42.70% xfs_file_write_iter - 41.32% xfs_file_dio_write_aligned - 25.51% xfs_file_write_checks - 21.60% _raw_spin_lock - 21.59% do_raw_spin_lock - 19.70% __pv_queued_spin_lock_slowpath This also happens of the IO completion IO path: 22.89% 0.69% [kernel] [k] xfs_dio_write_end_io - 22.49% xfs_dio_write_end_io - 21.79% _raw_spin_lock - 20.97% do_raw_spin_lock - 20.10% __pv_queued_spin_lock_slowpath IOWs, fio is burning ~14 whole CPUs on this spin lock. So, do an unlocked check against inode size first, then if we are at/beyond EOF, take the spinlock and recheck. This makes the spinlock disappear from the overwrite fastpath. I'd like to report that fixing this makes things go faster. It doesn't - it just exposes the the XFS_ILOCK as the next severe contention point doing extent mapping lookups, and that now burns all the 14 CPUs this spinlock was burning. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: mark xfs_bmap_set_attrforkoff staticChristoph Hellwig
xfs_bmap_set_attrforkoff is only used inside of xfs_bmap.c, so mark it static. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: Remove redundant assignment to busyJiapeng Chong
Variable busy is set to false, but this value is never read as it is overwritten or not used later on, hence it is a redundant assignment and can be removed. Clean up the following clang-analyzer warning: fs/xfs/libxfs/xfs_alloc.c:1679:2: warning: Value stored to 'busy' is never read [clang-analyzer-deadcode.DeadStores]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: sort variable alphabetically to avoid repeated declarationShaokun Zhang
Variable 'xfs_agf_buf_ops', 'xfs_agi_buf_ops', 'xfs_dquot_buf_ops' and 'xfs_symlink_buf_ops' are declared twice, so sort these variables alphabetically and remove the repeated declaration. Cc: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-01xfs: remove unnecessary shiftsunit-conversion-cleanups-5.14_2021-06-06unit-conversion-cleanups-5.14_2021-06-05unit-conversion-cleanups-5.14_2021-06-03unit-conversion-cleanups-5.14_2021-06-01Darrick J. Wong
The superblock verifier already validates that (1 << blocklog) == blocksize, so use the value directly instead of doing math. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-06-01xfs: clean up open-coded fs block unit conversionsDarrick J. Wong
Replace some open-coded fs block unit conversions with the standard conversion macro. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2021-05-29Merge tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fixes from Darrick Wong: "This week's pile mitigates some decades-old problems in how extent size hints interact with realtime volumes, fixes some failures in online shrink, and fixes a problem where directory and symlink shrinking on extremely fragmented filesystems could fail. The most user-notable change here is to point users at our (new) IRC channel on OFTC. Freedom isn't free, it costs folks like you and me; and if you don't kowtow, they'll expel everyone and take over your channel. (Ok, ok, that didn't fit the song lyrics...) Summary: - Fix a bug where unmapping operations end earlier than expected, which can cause chaos on multi-block directory and symlink shrink operations. - Fix an erroneous assert that can trigger if we try to transition a bmap structure from btree format to extents format with zero extents. This was exposed by xfs/538" * tag 'xfs-5.13-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: bunmapi has unnecessary AG lock ordering issues xfs: btree format inode forks can have zero extents xfs: add new IRC channel to MAINTAINERS xfs: validate extsz hints against rt extent size when rtinherit is set xfs: standardize extent size hint validation xfs: check free AG space when making per-AG reservations
2021-05-29Merge tag 'driver-core-5.13-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are three small driver core / debugfs fixes for 5.13-rc4: - debugfs fix for incorrect "lockdown" mode for selinux accesses - two device link changes, one bugfix and one cleanup All of these have been in linux-next for over a week with no reported problems" * tag 'driver-core-5.13-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: drivers: base: Reduce device link removal code duplication drivers: base: Fix device link removal debugfs: fix security_locked_down() call for SELinux
2021-05-28Merge tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring fixes from Jens Axboe: "A few minor fixes: - Fix an issue with hashed wait removal on exit (Zqiang, Pavel) - Fix a recent data race introduced in this series (Marco)" * tag 'io_uring-5.13-2021-05-28' of git://git.kernel.dk/linux-block: io_uring: fix data race to avoid potential NULL-deref io-wq: Fix UAF when wakeup wqe in hash waitqueue io_uring/io-wq: close io-wq full-stop gap
2021-05-28Merge tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French: "Three SMB3 fixes. Two for stable, and the other fixes a problem pointed out with a recently added ioctl" * tag '5.13-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6: cifs: change format of CIFS_FULL_KEY_DUMP ioctl cifs: fix string declarations and assignments in tracepoints cifs: set server->cipher_type to AES-128-CCM for SMB3.0
2021-05-28Merge tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: "Stable fixes: - Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config - Fix Oops in xs_tcp_send_request() when transport is disconnected - Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return() Bugfixes: - Fix instances where signal_pending() should be fatal_signal_pending() - fix an incorrect limit in filelayout_decode_layout() - Fixes for the SUNRPC backlogged RPC queue - Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() - Revert commit 586a0787ce35 ("Clean up rpcrdma_prepare_readch()")" * tag 'nfs-for-5.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: Remove trailing semicolon in macros xprtrdma: Revert 586a0787ce35 NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 config NFS: Clean up reset of the mirror accounting variables NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce() NFS: Fix an Oopsable condition in __nfs_pageio_add_request() SUNRPC: More fixes for backlog congestion SUNRPC: Fix Oops in xs_tcp_send_request() when transport is disconnected NFSv4: Fix a NULL pointer dereference in pnfs_mark_matching_lsegs_return() SUNRPC in case of backlog, hand free slots directly to waiting task pNFS/NFSv4: Remove redundant initialization of 'rd_size' NFS: fix an incorrect limit in filelayout_decode_layout() fs/nfs: Use fatal_signal_pending instead of signal_pending
2021-05-27cifs: change format of CIFS_FULL_KEY_DUMP ioctlAurelien Aptel
Make CIFS_FULL_KEY_DUMP ioctl able to return variable-length keys. * userspace needs to pass the struct size along with optional session_id and some space at the end to store keys * if there is enough space kernel returns keys in the extra space and sets the length of each key via xyz_key_length fields This also fixes the build error for get_user() on ARM. Sample program: #include <stdlib.h> #include <stdio.h> #include <stdint.h> #include <sys/fcntl.h> #include <sys/ioctl.h> struct smb3_full_key_debug_info { uint32_t in_size; uint64_t session_id; uint16_t cipher_type; uint8_t session_key_length; uint8_t server_in_key_length; uint8_t server_out_key_length; uint8_t data[]; /* * return this struct with the keys appended at the end: * uint8_t session_key[session_key_length]; * uint8_t server_in_key[server_in_key_length]; * uint8_t server_out_key[server_out_key_length]; */ } __attribute__((packed)); #define CIFS_IOCTL_MAGIC 0xCF #define CIFS_DUMP_FULL_KEY _IOWR(CIFS_IOCTL_MAGIC, 10, struct smb3_full_key_debug_info) void dump(const void *p, size_t len) { const char *hex = "0123456789ABCDEF"; const uint8_t *b = p; for (int i = 0; i < len; i++) printf("%c%c ", hex[(b[i]>>4)&0xf], hex[b[i]&0xf]); putchar('\n'); } int main(int argc, char **argv) { struct smb3_full_key_debug_info *keys; uint8_t buf[sizeof(*keys)+1024] = {0}; size_t off = 0; int fd, rc; keys = (struct smb3_full_key_debug_info *)&buf; keys->in_size = sizeof(buf); fd = open(argv[1], O_RDONLY); if (fd < 0) perror("open"), exit(1); rc = ioctl(fd, CIFS_DUMP_FULL_KEY, keys); if (rc < 0) perror("ioctl"), exit(1); printf("SessionId "); dump(&keys->session_id, 8); printf("Cipher %04x\n", keys->cipher_type); printf("SessionKey "); dump(keys->data+off, keys->session_key_length); off += keys->session_key_length; printf("ServerIn Key "); dump(keys->data+off, keys->server_in_key_length); off += keys->server_in_key_length; printf("ServerOut Key "); dump(keys->data+off, keys->server_out_key_length); return 0; } Usage: $ gcc -o dumpkeys dumpkeys.c Against Windows Server 2020 preview (with AES-256-GCM support): # mount.cifs //$ip/test /mnt -o "username=administrator,password=foo,vers=3.0,seal" # ./dumpkeys /mnt/somefile SessionId 0D 00 00 00 00 0C 00 00 Cipher 0002 SessionKey AB CD CC 0D E4 15 05 0C 6F 3C 92 90 19 F3 0D 25 ServerIn Key 73 C6 6A C8 6B 08 CF A2 CB 8E A5 7D 10 D1 5B DC ServerOut Key 6D 7E 2B A1 71 9D D7 2B 94 7B BA C4 F0 A5 A4 F8 # umount /mnt With 256 bit keys: # echo 1 > /sys/module/cifs/parameters/require_gcm_256 # mount.cifs //$ip/test /mnt -o "username=administrator,password=foo,vers=3.11,seal" # ./dumpkeys /mnt/somefile SessionId 09 00 00 00 00 0C 00 00 Cipher 0004 SessionKey 93 F5 82 3B 2F B7 2A 50 0B B9 BA 26 FB 8C 8B 03 ServerIn Key 6C 6A 89 B2 CB 7B 78 E8 04 93 37 DA 22 53 47 DF B3 2C 5F 02 26 70 43 DB 8D 33 7B DC 66 D3 75 A9 ServerOut Key 04 11 AA D7 52 C7 A8 0F ED E3 93 3A 65 FE 03 AD 3F 63 03 01 2B C0 1B D7 D7 E5 52 19 7F CC 46 B4 Signed-off-by: Aurelien Aptel <aaptel@suse.com> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-27cifs: fix string declarations and assignments in tracepointsShyam Prasad N
We missed using the variable length string macros in several tracepoints. Fixed them in this change. There's probably more useful macros that we can use to print others like flags etc. But I'll submit sepawrate patches for those at a future date. Signed-off-by: Shyam Prasad N <sprasad@microsoft.com> Cc: <stable@vger.kernel.org> # v5.12 Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-27cifs: set server->cipher_type to AES-128-CCM for SMB3.0Aurelien Aptel
SMB3.0 doesn't have encryption negotiate context but simply uses the SMB2_GLOBAL_CAP_ENCRYPTION flag. When that flag is present in the neg response cifs.ko uses AES-128-CCM which is the only cipher available in this context. cipher_type was set to the server cipher only when parsing encryption negotiate context (SMB3.1.1). For SMB3.0 it was set to 0. This means cipher_type value can be 0 or 1 for AES-128-CCM. Fix this by checking for SMB3.0 and encryption capability and setting cipher_type appropriately. Signed-off-by: Aurelien Aptel <aaptel@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com>
2021-05-27afs: Fix the nlink handling of dir-over-dir renameDavid Howells
Fix rename of one directory over another such that the nlink on the deleted directory is cleared to 0 rather than being decremented to 1. This was causing the generic/035 xfstest to fail. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/r/162194384460.3999479.7605572278074191079.stgit@warthog.procyon.org.uk/ # v1 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-27xfs: bunmapi has unnecessary AG lock ordering issuesxfs-5.13-fixes-3Dave Chinner
large directory block size operations are assert failing because xfs_bunmapi() is not completely removing fragmented directory blocks like so: XFS: Assertion failed: done, file: fs/xfs/libxfs/xfs_dir2.c, line: 677 .... Call Trace: xfs_dir2_shrink_inode+0x1a8/0x210 xfs_dir2_block_to_sf+0x2ae/0x410 xfs_dir2_block_removename+0x21a/0x280 xfs_dir_removename+0x195/0x1d0 xfs_rename+0xb79/0xc50 ? avc_has_perm+0x8d/0x1a0 ? avc_has_perm_noaudit+0x9a/0x120 xfs_vn_rename+0xdb/0x150 vfs_rename+0x719/0xb50 ? __lookup_hash+0x6a/0xa0 do_renameat2+0x413/0x5e0 __x64_sys_rename+0x45/0x50 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae We are aborting the bunmapi() pass because of this specific chunk of code: /* * Make sure we don't touch multiple AGF headers out of order * in a single transaction, as that could cause AB-BA deadlocks. */ if (!wasdel && !isrt) { agno = XFS_FSB_TO_AGNO(mp, del.br_startblock); if (prev_agno != NULLAGNUMBER && prev_agno > agno) break; prev_agno = agno; } This is designed to prevent deadlocks in AGF locking when freeing multiple extents by ensuring that we only ever lock in increasing AG number order. Unfortunately, this also violates the "bunmapi will always succeed" semantic that some high level callers depend on, such as xfs_dir2_shrink_inode(), xfs_da_shrink_inode() and xfs_inactive_symlink_rmt(). This AG lock ordering was introduced back in 2017 to fix deadlocks triggered by generic/299 as reported here: https://lore.kernel.org/linux-xfs/800468eb-3ded-9166-20a4-047de8018582@gmail.com/ This codebase is old enough that it was before we were defering all AG based extent freeing from within xfs_bunmapi(). THat is, we never actually lock AGs in xfs_bunmapi() any more - every non-rt based extent free is added to the defer ops list, as is all BMBT block freeing. And RT extents are not RT based, so there's no lock ordering issues associated with them. Hence this AGF lock ordering code is both broken and dead. Let's just remove it so that the large directory block code works reliably again. Tested against xfs/538 and generic/299 which is the original test that exposed the deadlocks that this code fixed. Fixes: 5b094d6dac04 ("xfs: fix multi-AG deadlock in xfs_bunmapi") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-27xfs: btree format inode forks can have zero extentsDave Chinner
xfs/538 is assert failing with this trace when testing with directory block sizes of 64kB: XFS: Assertion failed: !xfs_need_iread_extents(ifp), file: fs/xfs/libxfs/xfs_bmap.c, line: 608 .... Call Trace: xfs_bmap_btree_to_extents+0x2a9/0x470 ? kmem_cache_alloc+0xe7/0x220 __xfs_bunmapi+0x4ca/0xdf0 xfs_bunmapi+0x1a/0x30 xfs_dir2_shrink_inode+0x71/0x210 xfs_dir2_block_to_sf+0x2ae/0x410 xfs_dir2_block_removename+0x21a/0x280 xfs_dir_removename+0x195/0x1d0 xfs_remove+0x244/0x460 xfs_vn_unlink+0x53/0xa0 ? selinux_inode_unlink+0x13/0x20 vfs_unlink+0x117/0x220 do_unlinkat+0x1a2/0x2d0 __x64_sys_unlink+0x42/0x60 do_syscall_64+0x3a/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xae This is a check to ensure that the extents have been read into memory before we are doing a ifork btree manipulation. This assert is bogus in the above case. We have a fragmented directory block that has more extents in it than can fit in extent format, so the inode data fork is in btree format. xfs_dir2_shrink_inode() asks to remove all remaining 16 filesystem blocks from the inode so it can convert to short form, and __xfs_bunmapi() removes all the extents. We now have a data fork in btree format but have zero extents in the fork. This incorrectly trips the xfs_need_iread_extents() assert because it assumes that an empty extent btree means the extent tree has not been read into memory yet. This is clearly not the case with xfs_bunmapi(), as it has an explicit call to xfs_iread_extents() in it to pull the extents into memory before it starts unmapping. Also, the assert directly after this bogus one is: ASSERT(ifp->if_format == XFS_DINODE_FMT_BTREE); Which covers the context in which it is legal to call xfs_bmap_btree_to_extents just fine. Hence we should just remove the bogus assert as it is clearly wrong and causes a regression. The returns the test behaviour to the pre-existing assert failure in xfs_dir2_shrink_inode() that indicates xfs_bunmapi() has failed to remove all the extents in the range it was asked to unmap. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-05-27io_uring: fix data race to avoid potential NULL-derefio_uring-5.13-2021-05-28Marco Elver
Commit ba5ef6dc8a82 ("io_uring: fortify tctx/io_wq cleanup") introduced setting tctx->io_wq to NULL a bit earlier. This has caused KCSAN to detect a data race between accesses to tctx->io_wq: write to 0xffff88811d8df330 of 8 bytes by task 3709 on cpu 1: io_uring_clean_tctx fs/io_uring.c:9042 [inline] __io_uring_cancel fs/io_uring.c:9136 io_uring_files_cancel include/linux/io_uring.h:16 [inline] do_exit kernel/exit.c:781 do_group_exit kernel/exit.c:923 get_signal kernel/signal.c:2835 arch_do_signal_or_restart arch/x86/kernel/signal.c:789 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] ... read to 0xffff88811d8df330 of 8 bytes by task 6412 on cpu 0: io_uring_try_cancel_iowq fs/io_uring.c:8911 [inline] io_uring_try_cancel_requests fs/io_uring.c:8933 io_ring_exit_work fs/io_uring.c:8736 process_one_work kernel/workqueue.c:2276 ... With the config used, KCSAN only reports data races with value changes: this implies that in the case here we also know that tctx->io_wq was non-NULL. Therefore, depending on interleaving, we may end up with: [CPU 0] | [CPU 1] io_uring_try_cancel_iowq() | io_uring_clean_tctx() if (!tctx->io_wq) // false | ... ... | tctx->io_wq = NULL io_wq_cancel_cb(tctx->io_wq, ...) | ... -> NULL-deref | Note: It is likely that thus far we've gotten lucky and the compiler optimizes the double-read into a single read into a register -- but this is never guaranteed, and can easily change with a different config! Fix the data race by restoring the previous behaviour, where both setting io_wq to NULL and put of the wq are _serialized_ after concurrent io_uring_try_cancel_iowq() via acquisition of the uring_lock and removal of the node in io_uring_del_task_file(). Fixes: ba5ef6dc8a82 ("io_uring: fortify tctx/io_wq cleanup") Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Reported-by: syzbot+bf2b3d0435b9b728946c@syzkaller.appspotmail.com Signed-off-by: Marco Elver <elver@google.com> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20210527092547.2656514-1-elver@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-27nfs: Remove trailing semicolon in macrosHuilong Deng
Macros should not use a trailing semicolon. Signed-off-by: Huilong Deng <denghuilong@cdjrlc.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-27NFSv4: Fix v4.0/v4.1 SEEK_DATA return -ENOTSUPP when set NFS_V4_2 configZhang Xiaoxu
Since commit bdcc2cd14e4e ("NFSv4.2: handle NFS-specific llseek errors"), nfs42_proc_llseek would return -EOPNOTSUPP rather than -ENOTSUPP when SEEK_DATA on NFSv4.0/v4.1. This will lead xfstests generic/285 not run on NFSv4.0/v4.1 when set the CONFIG_NFS_V4_2, rather than run failed. Fixes: bdcc2cd14e4e ("NFSv4.2: handle NFS-specific llseek errors") Cc: <stable.vger.kernel.org> # 4.2 Signed-off-by: Zhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26io-wq: Fix UAF when wakeup wqe in hash waitqueueZqiang
BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650 Read of size 8 at addr ffff8880304250d8 by task iou-wrk-28796/28802 Call Trace: __dump_stack [inline] dump_stack+0x141/0x1d7 print_address_description.constprop.0.cold+0x5b/0x2c6 __kasan_report [inline] kasan_report.cold+0x7c/0xd8 __wake_up_common+0x637/0x650 __wake_up_common_lock+0xd0/0x130 io_worker_handle_work+0x9dd/0x1790 io_wqe_worker+0xb2a/0xd40 ret_from_fork+0x1f/0x30 Allocated by task 28798: kzalloc_node [inline] io_wq_create+0x3c4/0xdd0 io_init_wq_offload [inline] io_uring_alloc_task_context+0x1bf/0x6b0 __io_uring_add_task_file+0x29a/0x3c0 io_uring_add_task_file [inline] io_uring_install_fd [inline] io_uring_create [inline] io_uring_setup+0x209a/0x2bd0 do_syscall_64+0x3a/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 28798: kfree+0x106/0x2c0 io_wq_destroy+0x182/0x380 io_wq_put [inline] io_wq_put_and_exit+0x7a/0xa0 io_uring_clean_tctx [inline] __io_uring_cancel+0x428/0x530 io_uring_files_cancel do_exit+0x299/0x2a60 do_group_exit+0x125/0x310 get_signal+0x47f/0x2150 arch_do_signal_or_restart+0x2a8/0x1eb0 handle_signal_work[inline] exit_to_user_mode_loop [inline] exit_to_user_mode_prepare+0x171/0x280 __syscall_exit_to_user_mode_work [inline] syscall_exit_to_user_mode+0x19/0x60 do_syscall_64+0x47/0xb0 entry_SYSCALL_64_after_hwframe There are the following scenarios, hash waitqueue is shared by io-wq1 and io-wq2. (note: wqe is worker) io-wq1:worker2 | locks bit1 io-wq2:worker1 | waits bit1 io-wq1:worker3 | waits bit1 io-wq1:worker2 | completes all wqe bit1 work items io-wq1:worker2 | drop bit1, exit io-wq2:worker1 | locks bit1 io-wq1:worker3 | can not locks bit1, waits bit1 and exit io-wq1 | exit and free io-wq1 io-wq2:worker1 | drops bit1 io-wq1:worker3 | be waked up, even though wqe is freed After all iou-wrk belonging to io-wq1 have exited, remove wqe form hash waitqueue, it is guaranteed that there will be no more wqe belonging to io-wq1 in the hash waitqueue. Reported-by: syzbot+6cb11ade52aa17095297@syzkaller.appspotmail.com Signed-off-by: Zqiang <qiang.zhang@windriver.com> Link: https://lore.kernel.org/r/20210526050826.30500-1-qiang.zhang@windriver.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-26NFS: Clean up reset of the mirror accounting variablesTrond Myklebust
Now that nfs_pageio_do_add_request() resets the pg_count, we don't need these other inlined resets. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26NFS: Don't corrupt the value of pg_bytes_written in nfs_do_recoalesce()Trond Myklebust
The value of mirror->pg_bytes_written should only be updated after a successful attempt to flush out the requests on the list. Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-26NFS: Fix an Oopsable condition in __nfs_pageio_add_request()Trond Myklebust
Ensure that nfs_pageio_error_cleanup() resets the mirror array contents, so that the structure reflects the fact that it is now empty. Also change the test in nfs_pageio_do_add_request() to be more robust by checking whether or not the list is empty rather than relying on the value of pg_count. Fixes: a7d42ddb3099 ("nfs: add mirroring support to pgio layer") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2021-05-25io_uring/io-wq: close io-wq full-stop gapPavel Begunkov
There is an old problem with io-wq cancellation where requests should be killed and are in io-wq but are not discoverable, e.g. in @next_hashed or @linked vars of io_worker_handle_work(). It adds some unreliability to individual request canellation, but also may potentially get __io_uring_cancel() stuck. For instance: 1) An __io_uring_cancel()'s cancellation round have not found any request but there are some as desribed. 2) __io_uring_cancel() goes to sleep 3) Then workers wake up and try to execute those hidden requests that happen to be unbound. As we already cancel all requests of io-wq there, set IO_WQ_BIT_EXIT in advance, so preventing 3) from executing unbound requests. The workers will initially break looping because of getting a signal as they are threads of the dying/exec()'ing user task. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/abfcf8c54cb9e8f7bfbad7e9a0cc5433cc70bdc2.1621781238.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-05-25proc: Check /proc/$pid/attr/ writes against file openerKees Cook
Fix another "confused deputy" weakness[1]. Writes to /proc/$pid/attr/ files need to check the opener credentials, since these fds do not transition state across execve(). Without this, it is possible to trick another process (which may have different credentials) to write to its own /proc/$pid/attr/ files, leading to unexpected and possibly exploitable behaviors. [1] https://www.kernel.org/doc/html/latest/security/credentials.html?highlight=confused#open-file-credentials Fixes: 1da177e4c3f41 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-25Merge tag 'netfs-lib-fixes-20200525' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull netfs fixes from David Howells: "A couple of fixes to the new netfs lib: - Pass the AOP flags through from netfs_write_begin() into grab_cache_page_write_begin(). - Automatically enable in Kconfig netfs lib rather than presenting an option for manual enablement" * tag 'netfs-lib-fixes-20200525' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manual netfs: Pass flags through to grab_cache_page_write_begin()
2021-05-25afs: Fix fall-through warnings for ClangGustavo A. R. Silva
In preparation to enable -Wimplicit-fallthrough for Clang, fix multiple warnings by explicitly adding multiple fallthrough pseudo-keywords in places where the code is intended to fall through to the next case. Link: https://github.com/KSPP/linux/issues/115 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: linux-afs@lists.infradead.org cc: linux-hardening@vger.kernel.org Link: https://lore.kernel.org/r/51150b54e0b0431a2c401cd54f2c4e7f50e94601.1605896059.git.gustavoars@kernel.org/ # v1 Link: https://lore.kernel.org/r/20210420211615.GA51432@embeddedor/ # v2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-25netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manualDavid Howells
Make the netfs helper library selected automatically by the things that use it rather than being manually configured, even though it's required[1]. Fixes: 3a5829fefd3b ("netfs: Make a netfs helper module") Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/CAMuHMdXJZ7iNQE964CdBOU=vRKVMFzo=YF_eiwsGgqzuvZ+TuA@mail.gmail.com [1] Link: https://lore.kernel.org/r/162090298141.3166007.2971118149366779916.stgit@warthog.procyon.org.uk # v1
2021-05-25netfs: Pass flags through to grab_cache_page_write_begin()David Howells
In netfs_write_begin(), pass the AOP flags through to grab_cache_page_write_begin() so that a request to use GFP_NOFS is honoured. Fixes: e1b1240c1ff5 ("netfs: Add write_begin helper") Reported-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/162090295383.3165945.13595101698295243662.stgit@warthog.procyon.org.uk # v1
2021-05-24xfs: validate extsz hints against rt extent size when rtinherit is setDarrick J. Wong
The RTINHERIT bit can be set on a directory so that newly created regular files will have the REALTIME bit set to store their data on the realtime volume. If an extent size hint (and EXTSZINHERIT) are set on the directory, the hint will also be copied into the new file. As pointed out in previous patches, for realtime files we require the extent size hint be an integer multiple of the realtime extent, but we don't perform the same validation on a directory with both RTINHERIT and EXTSZINHERIT set, even though the only use-case of that combination is to propagate extent size hints into new realtime files. This leads to inode corruption errors when the bad values are propagated. Because there may be existing filesystems with such a configuration, we cannot simply amend the inode verifier to trip on these directories and call it a day because that will cause previously "working" filesystems to start throwing errors abruptly. Note that it's valid to have directories with rtinherit set even if there is no realtime volume, in which case the problem does not manifest because rtinherit is ignored if there's no realtime device; and it's possible that someone set the flag, crashed, repaired the filesystem (which clears the hint on the realtime file) and continued. Therefore, mitigate this issue in several ways: First, if we try to write out an inode with both rtinherit/extszinherit set and an unaligned extent size hint, turn off the hint to correct the error. Second, if someone tries to misconfigure a directory via the fssetxattr ioctl, fail the ioctl. Third, reverify both extent size hint values when we propagate heritable inode attributes from parent to child, to prevent misconfigurations from spreading. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-05-24xfs: standardize extent size hint validationDarrick J. Wong
While chasing a bug involving invalid extent size hints being propagated into newly created realtime files, I noticed that the xfs_ioctl_setattr checks for the extent size hints weren't the same as the ones now encoded in libxfs and used for validation in repair and mkfs. Because the checks in libxfs are more stringent than the ones in the ioctl, it's possible for a live system to set inode flags that immediately result in corruption warnings. Specifically, it's possible to set an extent size hint on an rtinherit directory without checking if the hint is aligned to the realtime extent size, which makes no sense since that combination is used only to seed new realtime files. Replace the open-coded and inadequate checks with the libxfs verifier versions and update the code comments a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-05-24xfs: check free AG space when making per-AG reservationsDarrick J. Wong
The new online shrink code exposed a gap in the per-AG reservation code, which is that we only return ENOSPC to callers if the entire fs doesn't have enough free blocks. Except for debugging mode, the reservation init code doesn't ever check that there's enough free space in that AG to cover the reservation. Not having enough space is not considered an immediate fatal error that requires filesystem offlining because (a) it's shouldn't be possible to wind up in that state through normal file operations and (b) even if one did, freeing data blocks would recover the situation. However, online shrink now needs to know if shrinking would not leave enough space so that it can abort the shrink operation. Hence we need to promote this assertion into an actual error return. Observed by running xfs/168 with a 1k block size, though in theory this could happen with any configuration. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
2021-05-22userfaultfd: hugetlbfs: fix new flag usage in error pathMike Kravetz
In commit d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") the use of PagePrivate to indicate a reservation count should be restored at free time was changed to the hugetlb specific flag HPageRestoreReserve. Changes to a userfaultfd error path as well as a VM_BUG_ON() in remove_inode_hugepages() were overlooked. Users could see incorrect hugetlb reserve counts if they experience an error with a UFFDIO_COPY operation. Specifically, this would be the result of an unlikely copy_huge_page_from_user error. There is not an increased chance of hitting the VM_BUG_ON. Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mina Almasry <almasry.mina@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-05-22Merge tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: - Fix BLKRRPART and deletion race (Gulam, Christoph) - NVMe pull request (Christoph): - nvme-tcp corruption and timeout fixes (Sagi Grimberg, Keith Busch) - nvme-fc teardown fix (James Smart) - nvmet/nvme-loop memory leak fixes (Wu Bo)" * tag 'block-5.13-2021-05-22' of git://git.kernel.dk/linux-block: block: fix a race between del_gendisk and BLKRRPART block: prevent block device lookups at the beginning of del_gendisk nvme-fc: clear q_live at beginning of association teardown nvme-tcp: rerun io_work if req_list is not empty nvme-tcp: fix possible use-after-completion nvme-loop: fix memory leak in nvme_loop_create_ctrl() nvmet: fix memory leak in nvmet_alloc_ctrl()