summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_trans.c
AgeCommit message (Collapse)Author
2021-06-08xfs: convert log vector chain to use list headsDave Chinner
Because the next change is going to require sorting log vectors, and that requires arbitrary rearrangement of the list which cannot be done easily with a single linked list. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-04xfs: log tickets don't need log client idDave Chinner
We currently set the log ticket client ID when we reserve a transaction. This client ID is only ever written to the log by a CIL checkpoint or unmount records, and so anything using a high level transaction allocated through xfs_trans_alloc() does not need a log ticket client ID to be set. For the CIL checkpoint, the client ID written to the journal is always XFS_TRANSACTION, and for the unmount record it is always XFS_LOG, and nothing else writes to the log. All of these operations tell xlog_write() exactly what they need to write to the log (the optype) and build their own opheaders for start, commit and unmount records. Hence we no longer need to set the client id in either the log ticket or the xfs_trans. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-04xfs: AIL needs asynchronous CIL forcingDave Chinner
The AIL pushing is stalling on log forces when it comes across pinned items. This is happening on removal workloads where the AIL is dominated by stale items that are removed from AIL when the checkpoint that marks the items stale is committed to the journal. This results is relatively few items in the AIL, but those that are are often pinned as directories items are being removed from are still being logged. As a result, many push cycles through the CIL will first issue a blocking log force to unpin the items. This can take some time to complete, with tracing regularly showing push delays of half a second and sometimes up into the range of several seconds. Sequences like this aren't uncommon: .... 399.829437: xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 270ms delay> 400.099622: xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0 400.099623: xfsaild: first lsn 0x11002f3600 400.099679: xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50 <wanted 50ms, got 500ms delay> 400.589348: xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0 400.589349: xfsaild: first lsn 0x1100305000 400.589595: xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50 <wanted 50ms, got 460ms delay> 400.950341: xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0 400.950343: xfsaild: first lsn 0x1100317c00 400.950436: xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20 <wanted 20ms, got 200ms delay> 401.142333: xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0 401.142334: xfsaild: first lsn 0x110032e600 401.142535: xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10 <wanted 10ms, got 10ms delay> 401.154323: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000 401.154328: xfsaild: first lsn 0x1100353000 401.154389: xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20 <wanted 20ms, got 300ms delay> 401.451525: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 401.451526: xfsaild: first lsn 0x1100353000 401.451804: xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50 <wanted 50ms, got 500ms delay> 401.933581: xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0 .... In each of these cases, every AIL pass saw 101 log items stuck on the AIL (pinned) with very few other items being found. Each pass, a log force was issued, and delay between last/first is the sleep time + the sync log force time. Some of these 101 items pinned the tail of the log. The tail of the log does slowly creep forward (first lsn), but the problem is that the log is actually out of reservation space because it's been running so many transactions that stale items that never reach the AIL but consume log space. Hence we have a largely empty AIL, with long term pins on items that pin the tail of the log that don't get pushed frequently enough to keep log space available. The problem is the hundreds of milliseconds that we block in the log force pushing the CIL out to disk. The AIL should not be stalled like this - it needs to run and flush items that are at the tail of the log with minimal latency. What we really need to do is trigger a log flush, but then not wait for it at all - we've already done our waiting for stuff to complete when we backed off prior to the log force being issued. Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we still do a blocking flush of the CIL and that is what is causing the issue. Hence we need a new interface for the CIL to trigger an immediate background push of the CIL to get it moving faster but not to wait on that to occur. While the CIL is pushing, the AIL can also be pushing. We already have an internal interface to do this - xlog_cil_push_now() - but we need a wrapper for it to be used externally. xlog_cil_force_seq() can easily be extended to do what we need as it already implements the synchronous CIL push via xlog_cil_push_now(). Add the necessary flags and "push current sequence" semantics to xlog_cil_force_seq() and convert the AIL pushing to use it. One of the complexities here is that the CIL push does not guarantee that the commit record for the CIL checkpoint is written to disk. The current log force ensures this by submitting the current ACTIVE iclog that the commit record was written to. We need the CIL to actually write this commit record to disk for an async push to ensure that the checkpoint actually makes it to disk and unpins the pinned items in the checkpoint on completion. Hence we need to pass down to the CIL push that we are doing an async flush so that it can switch out the commit_iclog if necessary to get written to disk when the commit iclog is finally released. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-06-04xfs: xfs_log_force_lsn isn't passed a LSNDave Chinner
In doing an investigation into AIL push stalls, I was looking at the log force code to see if an async CIL push could be done instead. This lead me to xfs_log_force_lsn() and looking at how it works. xfs_log_force_lsn() is only called from inode synchronisation contexts such as fsync(), and it takes the ip->i_itemp->ili_last_lsn value as the LSN to sync the log to. This gets passed to xlog_cil_force_lsn() via xfs_log_force_lsn() to flush the CIL to the journal, and then used by xfs_log_force_lsn() to flush the iclogs to the journal. The problem is that ip->i_itemp->ili_last_lsn does not store a log sequence number. What it stores is passed to it from the ->iop_committing method, which is called by xfs_log_commit_cil(). The value this passes to the iop_committing method is the CIL context sequence number that the item was committed to. As it turns out, xlog_cil_force_lsn() converts the sequence to an actual commit LSN for the related context and returns that to xfs_log_force_lsn(). xfs_log_force_lsn() overwrites it's "lsn" variable that contained a sequence with an actual LSN and then uses that to sync the iclogs. This caused me some confusion for a while, even though I originally wrote all this code a decade ago. ->iop_committing is only used by a couple of log item types, and only inode items use the sequence number it is passed. Let's clean up the API, CIL structures and inode log item to call it a sequence number, and make it clear that the high level code is using CIL sequence numbers and not on-disk LSNs for integrity synchronisation purposes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2021-04-29xfs: update superblock counters correctly for !lazysbcountDave Chinner
Keep the mount superblock counters up to date for !lazysbcount filesystems so that when we log the superblock they do not need updating in any way because they are already correct. It's found by what Zorro reported: 1. mkfs.xfs -f -l lazy-count=0 -m crc=0 $dev 2. mount $dev $mnt 3. fsstress -d $mnt -p 100 -n 1000 (maybe need more or less io load) 4. umount $mnt 5. xfs_repair -n $dev and I've seen no problem with this patch. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reported-by: Zorro Lang <zlang@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-29xfs: remove obsolete AGF counter debuggingDarrick J. Wong
In commit f8f2835a9cf3 we changed the behavior of XFS to use EFIs to remove blocks from an overfilled AGFL because there were complaints about transaction overruns that stemmed from trying to free multiple blocks in a single transaction. Unfortunately, that commit missed a subtlety in the debug-mode transaction accounting when a realtime volume is attached. If a realtime file undergoes a data fork mapping change such that realtime extents are allocated (or freed) in the same transaction that a data device block is also allocated (or freed), we can trip a debugging assertion. This can happen (for example) if a realtime extent is allocated and it is necessary to reshape the bmbt to hold the new mapping. When we go to allocate a bmbt block from an AG, the first thing the data device block allocator does is ensure that the freelist is the proper length. If the freelist is too long, it will trim the freelist to the proper length. In debug mode, trimming the freelist calls xfs_trans_agflist_delta() to record the decrement in the AG free list count. Prior to f8f28 we would put the free block back in the free space btrees in the same transaction, which calls xfs_trans_agblocks_delta() to record the increment in the AG free block count. Since AGFL blocks are included in the global free block count (fdblocks), there is no corresponding fdblocks update, so the AGFL free satisfies the following condition in xfs_trans_apply_sb_deltas: /* * Check that superblock mods match the mods made to AGF counters. */ ASSERT((tp->t_fdblocks_delta + tp->t_res_fdblocks_delta) == (tp->t_ag_freeblks_delta + tp->t_ag_flist_delta + tp->t_ag_btree_delta)); The comparison here used to be: (X + 0) == ((X+1) + -1 + 0), where X is the number blocks that were allocated. After commit f8f28 we defer the block freeing to the next chained transaction, which means that the calls to xfs_trans_agflist_delta and xfs_trans_agblocks_delta occur in separate transactions. The (first) transaction that shortens the free list trips on the comparison, which has now become: (X + 0) == ((X) + -1 + 0) because we haven't freed the AGFL block yet; we've only logged an intention to free it. When the second transaction (the deferred free) commits, it will evaluate the expression as: (0 + 0) == (1 + 0 + 0) and trip over that in turn. At this point, the astute reader may note that the two commits tagged by this patch have been in the kernel for a long time but haven't generated any bug reports. How is it that the author became aware of this bug? This originally surfaced as an intermittent failure when I was testing realtime rmap, but a different bug report by Zorro Lang reveals the same assertion occuring on !lazysbcount filesystems. The common factor to both reports (and why this problem wasn't previously reported) becomes apparent if we consider when xfs_trans_apply_sb_deltas is called by __xfs_trans_commit(): if (tp->t_flags & XFS_TRANS_SB_DIRTY) xfs_trans_apply_sb_deltas(tp); With a modern lazysbcount filesystem, transactions update only the percpu counters, so they don't need to set XFS_TRANS_SB_DIRTY, hence xfs_trans_apply_sb_deltas is rarely called. However, updates to the count of free realtime extents are not part of lazysbcount, so XFS_TRANS_SB_DIRTY will be set on transactions adding or removing data fork mappings to realtime files; similarly, XFS_TRANS_SB_DIRTY is always set on !lazysbcount filesystems. Dave mentioned in response to an earlier version of this patch: "IIUC, what you are saying is that this debug code is simply not exercised in normal testing and hasn't been for the past decade? And it still won't be exercised on anything other than realtime device testing? "...it was debugging code from 1994 that was largely turned into dead code when lazysbcounters were introduced in 2007. Hence I'm not sure it holds any value anymore." This debugging code isn't especially helpful - you can modify the flcount on one AG and the freeblks of another AG, and it won't trigger. Add the fact that nobody noticed for a decade, and let's just get rid of it (and start testing realtime :P). This bug was found by running generic/051 on either a V4 filesystem lacking lazysbcount; or a V5 filesystem with a realtime volume. Cc: bfoster@redhat.com, zlang@redhat.com Fixes: f8f2835a9cf3 ("xfs: defer agfl block frees when dfops is available") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-04-07xfs: move the di_nblocks field to struct xfs_inodeChristoph Hellwig
In preparation of removing the historic icinode struct, move the nblocks field into the containing xfs_inode structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: support shrinking unused space in the last AGGao Xiang
As the first step of shrinking, this attempts to enable shrinking unused space in the last allocation group by fixing up freespace btree, agi, agf and adjusting super block and use a helper xfs_ag_shrink_space() to fixup the last AG. This can be all done in one transaction for now, so I think no additional protection is needed. Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: __percpu_counter_compare() inode count debug too expensiveDave Chinner
- 21.92% __xfs_trans_commit - 21.62% xfs_log_commit_cil - 11.69% xfs_trans_unreserve_and_mod_sb - 11.58% __percpu_counter_compare - 11.45% __percpu_counter_sum - 10.29% _raw_spin_lock_irqsave - 10.28% do_raw_spin_lock __pv_queued_spin_lock_slowpath We debated just getting rid of it last time this came up and there was no real objection to removing it. Now it's the biggest scalability limitation for debug kernels even on smallish machines, so let's just get rid of it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-02-25xfs: use current->journal_info for detecting transaction recursionxfs-5.12-merge-6Dave Chinner
Because the iomap code using PF_MEMALLOC_NOFS to detect transaction recursion in XFS is just wrong. Remove it from the iomap code and replace it with XFS specific internal checks using current->journal_info instead. [djwong: This change also realigns the lifetime of NOFS flag changes to match the incore transaction, instead of the inconsistent scheme we have now.] Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS") Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-25xfs: don't nest transactions when scanning for eofblocksDarrick J. Wong
Brian Foster reported a lockdep warning on xfs/167: ============================================ WARNING: possible recursive locking detected 5.11.0-rc4 #35 Tainted: G W I -------------------------------------------- fsstress/17733 is trying to acquire lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs] but task is already holding lock: ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs] stack backtrace: CPU: 38 PID: 17733 Comm: fsstress Tainted: G W I 5.11.0-rc4 #35 Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018 Call Trace: dump_stack+0x8b/0xb0 __lock_acquire.cold+0x159/0x2ab lock_acquire+0x116/0x370 xfs_trans_alloc+0x1ad/0x310 [xfs] xfs_free_eofblocks+0x104/0x1d0 [xfs] xfs_blockgc_scan_inode+0x24/0x60 [xfs] xfs_inode_walk_ag+0x202/0x4b0 [xfs] xfs_inode_walk+0x66/0xc0 [xfs] xfs_trans_alloc+0x160/0x310 [xfs] xfs_trans_alloc_inode+0x5f/0x160 [xfs] xfs_alloc_file_space+0x105/0x300 [xfs] xfs_file_fallocate+0x270/0x460 [xfs] vfs_fallocate+0x14d/0x3d0 __x64_sys_fallocate+0x3e/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The cause of this is the new code that spurs a scan to garbage collect speculative preallocations if we fail to reserve enough blocks while allocating a transaction. While the warning itself is a fairly benign lockdep complaint, it does expose a potential livelock if the rwsem behavior ever changes with regards to nesting read locks when someone's waiting for a write lock. Fix this by freeing the transaction and jumping back to xfs_trans_alloc like this patch in the V4 submission[1]. [1] https://lore.kernel.org/linux-xfs/161142798066.2171939.9311024588681972086.stgit@magnolia/ Fixes: a1a7d05a0576 ("xfs: flush speculative space allocations when we run out of space") Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: flush speculative space allocations when we run out of spaceDarrick J. Wong
If a fs modification (creation, file write, reflink, etc.) is unable to reserve enough space to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: flush eof/cowblocks if we can't reserve quota for chownDarrick J. Wong
If a file user, group, or project change is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: flush eof/cowblocks if we can't reserve quota for inode creationDarrick J. Wong
If an inode creation is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: flush eof/cowblocks if we can't reserve quota for file blocksDarrick J. Wong
If a fs modification (data write, reflink, xattr set, fallocate, etc.) is unable to reserve enough quota to handle the modification, try clearing whatever space the filesystem might have been hanging onto in the hopes of speeding up the filesystem. The flushing behavior will become particularly important when we add deferred inode inactivation because that will increase the amount of space that isn't actively tied to user data. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: remove xfs_qm_vop_chown_reserveDarrick J. Wong
Now that the only caller of this function is xfs_trans_alloc_ichange, just open-code the meat of _chown_reserve in that caller. Drop the (redundant) [ugp]id checks because xfs has a 1:1 relationship between quota ids and incore dquots. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: refactor inode ownership change transaction/inode/quota allocation idiomDarrick J. Wong
For file ownership (uid, gid, prid) changes, create a new helper xfs_trans_alloc_ichange that allocates a transaction and reserves the appropriate amount of quota against that transction in preparation for a change of user, group, or project id. Replace all the open-coded idioms with a single call to this helper so that we can contain the retry loops in the next patchset. This changes the locking behavior for ichange transactions slightly. Since tr_ichange does not have a permanent reservation and cannot roll, we pass XFS_ILOCK_EXCL to ijoin so that the inode will be unlocked automatically at commit time. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: refactor inode creation transaction/inode/quota allocation idiomDarrick J. Wong
For file creation, create a new helper xfs_trans_alloc_icreate that allocates a transaction and reserves the appropriate amount of quota against that transction. Replace all the open-coded idioms with a single call to this helper so that we can contain the retry loops in the next patchset. This changes the locking behavior for non-tempfile creation slightly, in that we now make the quota reservation without holding the directory ILOCK. While the dquots chosen for inode creation are based on the directory state at a given point in time, the directory ILOCK was released as soon as the dquot references are picked up. Hence it was never necessary to hold the directory ILOCK for the quota reservation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: allow reservation of rtblocks with xfs_trans_alloc_inodeDarrick J. Wong
Make it so that we can reserve rt blocks with the xfs_trans_alloc_inode wrapper function, then convert a few more callsites. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2021-02-03xfs: refactor common transaction/inode/quota allocation idiomDarrick J. Wong
Create a new helper xfs_trans_alloc_inode that allocates a transaction, locks and joins an inode to it, and then reserves the appropriate amount of quota against that transction. Then replace all the open-coded idioms with a single call to this helper. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-12-16xfs: remove xfs_buf_t typedefxfs-5.11-merge-4Dave Chinner
Prepare for kernel xfs_buf alignment by getting rid of the xfs_buf_t typedef from userspace. [darrick: This patch is a port of a userspace patch removing the xfs_buf_t typedef in preparation to make the userspace xfs_buf code behave more like its kernel counterpart.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-25xfs: do the assert for all the log done items in xfs_trans_cancelKaixu Xia
We should do the assert for all the log intent-done items if they appear here. This patch detect intent-done items by the fact that their item ops don't have iop_unpin and iop_push methods and also move the helper xlog_item_is_intent to xfs_trans.h. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: simplify xfs_trans_getsbChristoph Hellwig
Remove the mp argument as this function is only called in transaction context, and open code xfs_getsb given that the function already accesses the buffer pointer in the mount point directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-28xfs: Remove kmem_zone_zalloc() usageCarlos Maiolino
Use kmem_cache_zalloc() directly. With the exception of xlog_ticket_alloc() which will be dealt on the next patch for readability. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-07-06xfs: preserve rmapbt swapext block reservation from freed blocksBrian Foster
The rmapbt extent swap algorithm remaps individual extents between the source inode and the target to trigger reverse mapping metadata updates. If either inode straddles a format or other bmap allocation boundary, the individual unmap and map cycles can trigger repeated bmap block allocations and frees as the extent count bounces back and forth across the boundary. While net block usage is bound across the swap operation, this behavior can prematurely exhaust the transaction block reservation because it continuously drains as the transaction rolls. Each allocation accounts against the reservation and each free returns to global free space on transaction roll. The previous workaround to this problem attempted to detect this boundary condition and provide surplus block reservation to acommodate it. This is insufficient because more remaps can occur than implied by the extent counts; if start offset boundaries are not aligned between the two inodes, for example. To address this problem more generically and dynamically, add a transaction accounting mode that returns freed blocks to the transaction reservation instead of the superblock counters on transaction roll and use it when the rmapbt based algorithm is active. This allows the chain of remap transactions to preserve the block reservation based own its own frees and prevent premature exhaustion regardless of the remap pattern. Note that this is only safe for superblocks with lazy sb accounting, but the latter is required for v5 supers and the rmap feature depends on v5. Fixes: b3fed434822d0 ("xfs: account format bouncing into rmapbt swapext tx reservation") Root-caused-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: remove the m_active_trans counterDave Chinner
It's a global atomic counter, and we are hitting it at a rate of half a million transactions a second, so it's bouncing the counter cacheline all over the place on large machines. We don't actually need it anymore - it used to be required because the VFS freeze code could not track/prevent filesystem transactions that were running, but that problem no longer exists. Hence to remove the counter, we simply have to ensure that nothing calls xfs_sync_sb() while we are trying to quiesce the filesytem. That only happens if the log worker is still running when we call xfs_quiesce_attr(). The log worker is cancelled at the end of xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it early here and then we can remove the counter altogether. Concurrent create, 50 million inodes, identical 16p/16GB virtual machines on different physical hosts. Machine A has twice the CPU cores per socket of machine B: unpatched patched machine A: 3m16s 2m00s machine B: 4m04s 4m05s Create rates: unpatched patched machine A: 282k+/-31k 468k+/-21k machine B: 231k+/-8k 233k+/-11k Concurrent rm of same 50 million inodes: unpatched patched machine A: 6m42s 2m33s machine B: 4m47s 4m47s The transaction rate on the fast machine went from just under 300k/sec to 700k/sec, which indicates just how much of a bottleneck this atomic counter was. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: reduce free inode accounting overheadDave Chinner
Shaokun Zhang reported that XFS was using substantial CPU time in percpu_count_sum() when running a single threaded benchmark on a high CPU count (128p) machine from xfs_mod_ifree(). The issue is that the filesystem is empty when the benchmark runs, so inode allocation is running with a very low inode free count. With the percpu counter batching, this means comparisons when the counter is less that 128 * 256 = 32768 use the slow path of adding up all the counters across the CPUs, and this is expensive on high CPU count machines. The summing in xfs_mod_ifree() is only used to fire an assert if an underrun occurs. The error is ignored by the higher level code. Hence this is really just debug code and we don't need to run it on production kernels, nor do we need such debug checks to return error values just to trigger an assert. Finally, xfs_mod_icount/xfs_mod_ifree are only called from xfs_trans_unreserve_and_mod_sb(), so get rid of them and just directly call the percpu_counter_add/percpu_counter_compare functions. The compare functions are now run only on debug builds as they are internal to ASSERT() checks and so only compiled in when ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y). Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: gut error handling in xfs_trans_unreserve_and_mod_sb()Dave Chinner
xfs: gut error handling in xfs_trans_unreserve_and_mod_sb() From: Dave Chinner <dchinner@redhat.com> The error handling in xfs_trans_unreserve_and_mod_sb() is largely incorrect - rolling back the changes in the transaction if only one counter underruns makes all the other counters incorrect. We still allow the change to proceed and committing the transaction, except now we have multiple incorrect counters instead of a single underflow. Further, we don't actually report the error to the caller, so this is completely silent except on debug kernels that will assert on failure before we even get to the rollback code. Hence this error handling is broken, untested, and largely unnecessary complexity. Just remove it. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-27xfs: split xlog_ticket_doneChristoph Hellwig
Remove xlog_ticket_done and just call the renamed low-level helpers for ungranting or regranting log space directly. To make that a little the reference put on the ticket and all tracing is moved into the actual helpers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-27xfs: refactor and split xfs_log_done()Dave Chinner
xfs_log_done() does two separate things. Firstly, it triggers commit records to be written for permanent transactions, and secondly it releases or regrants transaction reservation space. Since delayed logging was introduced, transactions no longer write directly to the log, hence they never have the XLOG_TIC_INITED flag cleared on them. Hence transactions never write commit records to the log and only need to modify reservation space. Split up xfs_log_done into two parts, and only call the parts of the operation needed for the context xfs_log_done() is currently being called from. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-03-26xfs: prohibit fs freezing when using empty transactionsxfs-5.7-merge-8Darrick J. Wong
I noticed that fsfreeze can take a very long time to freeze an XFS if there happens to be a GETFSMAP caller running in the background. I also happened to notice the following in dmesg: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs] Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs] CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs] Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de <0f> 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54 RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000 R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8 FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0 Call Trace: xfs_fs_freeze+0x25/0x40 [xfs] freeze_super+0xc8/0x180 do_vfs_ioctl+0x70b/0x750 ? __fget_files+0x135/0x210 ksys_ioctl+0x3a/0xb0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x1a0 entry_SYSCALL_64_after_hwframe+0x49/0xbe These two things appear to be related. The assertion trips when another thread initiates a fsmap request (which uses an empty transaction) after the freezer waited for m_active_trans to hit zero but before the the freezer executes the WARN_ON just prior to calling xfs_log_quiesce. The lengthy delays in freezing happen because the freezer calls xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the GETFSMAP caller is continuing to grab and release buffers, which means that it can take a very long time for the buffer lru list to empty out. We fix both of these races by calling sb_start_write to obtain freeze protection while using empty transactions for GETFSMAP and for metadata scrubbing. The other two users occur during mount, during which time we cannot fs freeze. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-03-11xfs: remove XFS_BUF_TO_SBPxfs-5.7-merge-2Christoph Hellwig
Just dereference bp->b_addr directly and make the code a little simpler and more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-18xfs: Remove kmem_zone_free() wrapperCarlos Maiolino
We can remove it now, without needing to rework the KM_ flags. Use kmem_cache_free() directly. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26fs: xfs: Remove KM_NOSLEEP and KM_SLEEP.Tetsuo Handa
Since no caller is using KM_NOSLEEP and no callee branches on KM_SLEEP, we can remove KM_NOSLEEP and replace KM_SLEEP with 0. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-30xfs: remove XFS_TRANS_NOFSxfs-5.3-merge-6Christoph Hellwig
Instead of a magic flag for xfs_trans_alloc, just ensure all callers that can't relclaim through the file system use memalloc_nofs_save to set the per-task nofs flag. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: remove unused header filesEric Sandeen
There are many, many xfs header files which are included but unneeded (or included twice) in the xfs code, so remove them. nb: xfs_linux.h includes about 9 headers for everyone, so those explicit includes get removed by this. I'm not sure what the preference is, but if we wanted explicit includes everywhere, a followup patch could remove those xfs_*.h includes from xfs_linux.h and move them into the files that need them. Or it could be left as-is. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: add a flag to release log items on commitChristoph Hellwig
We have various items that are released from ->iop_comitting. Add a flag to just call ->iop_release from the commit path to avoid tons of boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: split iop_unlockChristoph Hellwig
The iop_unlock method is called when comitting or cancelling a transaction. In the latter case, the transaction may or may not be aborted. While there is no known problem with the current code in practice, this implementation is limited in that any log item implementation that might want to differentiate between a commit and a cancellation must rely on the aborted state. The aborted bit is only set when the cancelled transaction is dirty, however. This means that there is no way to distinguish between a commit and a clean transaction cancellation. For example, intent log items currently rely on this distinction. The log item is either transferred to the CIL on commit or released on transaction cancel. There is currently no possibility for a clean intent log item in a transaction, but if that state is ever introduced a cancel of such a transaction will immediately result in memory leaks of the associated log item(s). This is an interface deficiency and landmine. To clean this up, replace the iop_unlock method with an iop_release method that is specific to transaction cancel. The existing iop_committing method occurs at the same time as iop_unlock in the commit path and there is no need for two separate callbacks here. Overload the iop_committing method with the current commit time iop_unlock implementations to eliminate the need for the latter and further simplify the interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: don't use xfs_trans_free_items in the commit pathChristoph Hellwig
While commiting items looks very similar to freeing them on error it is a different operation, and they will diverge a bit soon. Split out the commit case from xfs_trans_free_items, inline it into xfs_log_commit_cil and give it a separate trace point. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: don't require log items to implement optional methodsChristoph Hellwig
Just check if they are present first. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: stop using XFS_LI_ABORTED as a parameter flagChristoph Hellwig
Just pass a straight bool aborted instead of abusing XFS_LI_ABORTED as a flag in function parameters. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: fix a trivial comment typo in xfs_trans_committed_bulkChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-12xfs: remove unused flags arg from getsb interfacesEric Sandeen
The flags value is always passed as 0 so remove the argument. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-09-29xfs: avoid lockdep false positives in xfs_trans_allocDave Chinner
We've had a few reports of lockdep tripping over memory reclaim context vs filesystem freeze "deadlocks". They all have looked to be false positives on analysis, but it seems that they are being tripped because we take freeze references before we run a GFP_KERNEL allocation for the struct xfs_trans. We can avoid this false positive vector just by re-ordering the operations in xfs_trans_alloc(). That is. we need allocate the structure before we take the freeze reference and enter the GFP_NOFS allocation context that follows the xfs_trans around. This prevents lockdep from seeing the GFP_KERNEL allocation inside the transaction context, and that prevents it from triggering the freeze level vs alloc context vs reclaim warnings. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2018-08-02xfs: fold dfops into the transactionBrian Foster
struct xfs_defer_ops has now been reduced to a single list_head. The external dfops mechanism is unused and thus everywhere a (permanent) transaction is accessible the associated dfops structure is as well. Remove the xfs_defer_ops structure and fold the list_head into the transaction. Also remove the last remnant of external dfops in xfs_trans_dup(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02xfs: replace xfs_defer_ops ->dop_pending with on-stack listBrian Foster
The xfs_defer_ops ->dop_pending list is used to track active deferred operations once intents are logged. These items must be aborted in the event of an error. The list is populated as intents are logged and items are removed as they complete (or are aborted). Now that xfs_defer_finish() cancels on error, there is no need to ever access ->dop_pending outside of xfs_defer_finish(). The list is only ever populated after xfs_defer_finish() begins and is either completed or cancelled before it returns. Remove ->dop_pending from xfs_defer_ops and replace it with a local list in the xfs_defer_finish() path. Pass the local list to the various helpers now that it is not accessible via dfops. Note that we have to check for NULL in the abort case as the final tx roll occurs outside of the scope of the new local list (once the dfops has completed and thus drained the list). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02xfs: cancel dfops on xfs_defer_finish() errorBrian Foster
The current semantics of xfs_defer_finish() require the caller to call xfs_defer_cancel() on error. This is slightly inconsistent with transaction commit error handling where a failed commit cleans up the transaction before returning. More significantly, the only requirement for exposure of ->dop_pending outside of xfs_defer_finish() is so that xfs_defer_cancel() can drain it on error. Since the only recourse of xfs_defer_finish() errors is cancellation, mirror the transaction logic and cancel remaining dfops before returning from xfs_defer_finish() with an error. Beside simplifying xfs_defer_finish() semantics, this ensures that xfs_defer_finish() always returns with an empty ->dop_pending and thus facilitates removal of the list from xfs_defer_ops. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02xfs: clean out superfluous dfops dop params/varsBrian Foster
The dfops code still passes around the xfs_defer_ops pointer superfluously in a few places. Clean this up wherever the transaction will suffice. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02xfs: pass transaction to dfops reset/move helpersBrian Foster
All callers pass ->t_dfops of the associated transactions. Refactor the helpers to receive the transactions and facilitate further cleanups between xfs_defer_ops and xfs_trans. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2018-08-02xfs: remove unused __xfs_defer_cancel() internal helperBrian Foster
With no more external dfops users, there is no need for an xfs_defer_ops cancel wrapper. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>