summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-10-07xfs: fix the indent in xfs_trans_mod_dquotxfs-5.10-merge_2020-10-10xfs-5.10-merge_2020-10-08xfs-5.10-merge_2020-10-07xfs-5.10-merge-3Kaixu Xia
The formatting is strange in xfs_trans_mod_dquot, so do a reindent. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-07xfs: do the ASSERT for the arguments O_{u,g,p}dqppKaixu Xia
If we pass in XFS_QMOPT_{U,G,P}QUOTA flags and different uid/gid/prid than them currently associated with the inode, the arguments O_{u,g,p}dqpp shouldn't be NULL, so add the ASSERT for them. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-10-07xfs: fix deadlock and streamline xfs_getfsmap performanceDarrick J. Wong
Refactor xfs_getfsmap to improve its performance: instead of indirectly calling a function that copies one record to userspace at a time, create a shadow buffer in the kernel and copy the whole array once at the end. On the author's computer, this reduces the runtime on his /home by ~20%. This also eliminates a deadlock when running GETFSMAP against the realtime device. The current code locks the rtbitmap to create fsmappings and copies them into userspace, having not released the rtbitmap lock. If the userspace buffer is an mmap of a sparse file that itself resides on the realtime device, the write page fault will recurse into the fs for allocation, which will deadlock on the rtbitmap lock. Fixes: 4c934c7dd60c ("xfs: report realtime space information via the rtbitmap") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-07xfs: limit entries returned when counting fsmap recordsDarrick J. Wong
If userspace asked fsmap to count the number of entries, we cannot return more than UINT_MAX entries because fmh_entries is u32. Therefore, stop counting if we hit this limit or else we will waste time to return truncated results. Fixes: e89c041338ed ("xfs: implement the GETFSMAP ioctl") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2020-10-07xfs: only relog deferred intent items if free space in the log gets lowDarrick J. Wong
Now that we have the ability to ask the log how far the tail needs to be pushed to maintain its free space targets, augment the decision to relog an intent item so that we only do it if the log has hit the 75% full threshold. There's no point in relogging an intent into the same checkpoint, and there's no need to relog if there's plenty of free space in the log. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07xfs: expose the log push thresholdDarrick J. Wong
Separate the computation of the log push threshold and the push logic in xlog_grant_push_ail. This enables higher level code to determine (for example) that it is holding on to a logged intent item and the log is so busy that it is more than 75% full. In that case, it would be desirable to move the log item towards the head to release the tail, which we will cover in the next patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07xfs: periodically relog deferred intent itemsDarrick J. Wong
There's a subtle design flaw in the deferred log item code that can lead to pinning the log tail. Taking up the defer ops chain examples from the previous commit, we can get trapped in sequences like this: Caller hands us a transaction t0 with D0-D3 attached. The defer ops chain will look like the following if the transaction rolls succeed: t1: D0(t0), D1(t0), D2(t0), D3(t0) t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) In transaction 9, we finish d9 and try to roll to t10 while holding onto an intent item for D3 that we logged in t0. The previous commit changed the order in which we place new defer ops in the defer ops processing chain to reduce the maximum chain length. Now make xfs_defer_finish_noroll capable of relogging the entire chain periodically so that we can always move the log tail forward. Most chains will never get relogged, except for operations that generate very long chains (large extents containing many blocks with different sharing levels) or are on filesystems with small logs and a lot of ongoing metadata updates. Callers are now required to ensure that the transaction reservation is large enough to handle logging done items and new intent items for the maximum possible chain length. Most callers are careful to keep the chain lengths low, so the overhead should be minimal. The decision to relog an intent item is made based on whether the intent was logged in a previous checkpoint, since there's no point in relogging an intent into the same checkpoint. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07xfs: change the order in which child and parent defer ops are finishedDarrick J. Wong
The defer ops code has been finishing items in the wrong order -- if a top level defer op creates items A and B, and finishing item A creates more defer ops A1 and A2, we'll put the new items on the end of the chain and process them in the order A B A1 A2. This is kind of weird, since it's convenient for programmers to be able to think of A and B as an ordered sequence where all the sub-tasks for A must finish before we move on to B, e.g. A A1 A2 D. Right now, our log intent items are not so complex that this matters, but this will become important for the atomic extent swapping patchset. In order to maintain correct reference counting of extents, we have to unmap and remap extents in that order, and we want to complete that work before moving on to the next range that the user wants to swap. This patch fixes defer ops to satsify that requirement. The primary symptom of the incorrect order was noticed in an early performance analysis of the atomic extent swap code. An astonishingly large number of deferred work items accumulated when userspace requested an atomic update of two very fragmented files. The cause of this was traced to the same ordering bug in the inner loop of xfs_defer_finish_noroll. If the ->finish_item method of a deferred operation queues new deferred operations, those new deferred ops are appended to the tail of the pending work list. To illustrate, say that a caller creates a transaction t0 with four deferred operations D0-D3. The first thing defer ops does is roll the transaction to t1, leaving us with: t1: D0(t0), D1(t0), D2(t0), D3(t0) Let's say that finishing each of D0-D3 will create two new deferred ops. After finish D0 and roll, we'll have the following chain: t2: D1(t0), D2(t0), D3(t0), d4(t1), d5(t1) d4 and d5 were logged to t1. Notice that while we're about to start work on D1, we haven't actually completed all the work implied by D0 being finished. So far we've been careful (or lucky) to structure the dfops callers such that D1 doesn't depend on d4 or d5 being finished, but this is a potential logic bomb. There's a second problem lurking. Let's see what happens as we finish D1-D3: t3: D2(t0), D3(t0), d4(t1), d5(t1), d6(t2), d7(t2) t4: D3(t0), d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3) t5: d4(t1), d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) Let's say that d4-d11 are simple work items that don't queue any other operations, which means that we can complete each d4 and roll to t6: t6: d5(t1), d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) t7: d6(t2), d7(t2), d8(t3), d9(t3), d10(t4), d11(t4) ... t11: d10(t4), d11(t4) t12: d11(t4) <done> When we try to roll to transaction #12, we're holding defer op d11, which we logged way back in t4. This means that the tail of the log is pinned at t4. If the log is very small or there are a lot of other threads updating metadata, this means that we might have wrapped the log and cannot get roll to t11 because there isn't enough space left before we'd run into t4. Let's shift back to the original failure. I mentioned before that I discovered this flaw while developing the atomic file update code. In that scenario, we have a defer op (D0) that finds a range of file blocks to remap, creates a handful of new defer ops to do that, and then asks to be continued with however much work remains. So, D0 is the original swapext deferred op. The first thing defer ops does is rolls to t1: t1: D0(t0) We try to finish D0, logging d1 and d2 in the process, but can't get all the work done. We log a done item and a new intent item for the work that D0 still has to do, and roll to t2: t2: D0'(t1), d1(t1), d2(t1) We roll and try to finish D0', but still can't get all the work done, so we log a done item and a new intent item for it, requeue D0 a second time, and roll to t3: t3: D0''(t2), d1(t1), d2(t1), d3(t2), d4(t2) If it takes 48 more rolls to complete D0, then we'll finally dispense with D0 in t50: t50: D<fifty primes>(t49), d1(t1), ..., d102(t50) We then try to roll again to get a chain like this: t51: d1(t1), d2(t1), ..., d101(t50), d102(t50) ... t152: d102(t50) <done> Notice that in rolling to transaction #51, we're holding on to a log intent item for d1 that was logged in transaction #1. This means that the tail of the log is pinned at t1. If the log is very small or there are a lot of other threads updating metadata, this means that we might have wrapped the log and cannot roll to t51 because there isn't enough space left before we'd run into t1. This is of course problem #2 again. But notice the third problem with this scenario: we have 102 defer ops tied to this transaction! Each of these items are backed by pinned kernel memory, which means that we risk OOM if the chains get too long. Yikes. Problem #1 is a subtle logic bomb that could hit someone in the future; problem #2 applies (rarely) to the current upstream, and problem #3 applies to work under development. This is not how incremental deferred operations were supposed to work. The dfops design of logging in the same transaction an intent-done item and a new intent item for the work remaining was to make it so that we only have to juggle enough deferred work items to finish that one small piece of work. Deferred log item recovery will find that first unfinished work item and restart it, no matter how many other intent items might follow it in the log. Therefore, it's ok to put the new intents at the start of the dfops chain. For the first example, the chains look like this: t2: d4(t1), d5(t1), D1(t0), D2(t0), D3(t0) t3: d5(t1), D1(t0), D2(t0), D3(t0) ... t9: d9(t7), D3(t0) t10: D3(t0) t11: d10(t10), d11(t10) t12: d11(t10) For the second example, the chains look like this: t1: D0(t0) t2: d1(t1), d2(t1), D0'(t1) t3: d2(t1), D0'(t1) t4: D0'(t1) t5: d1(t4), d2(t4), D0''(t4) ... t148: D0<50 primes>(t147) t149: d101(t148), d102(t148) t150: d102(t148) <done> This actually sucks more for pinning the log tail (we try to roll to t10 while holding an intent item that was logged in t1) but we've solved problem #1. We've also reduced the maximum chain length from: sum(all the new items) + nr_original_items to: max(new items that each original item creates) + nr_original_items This solves problem #3 by sharply reducing the number of defer ops that can be attached to a transaction at any given time. The change makes the problem of log tail pinning worse, but is improvement we need to solve problem #2. Actually solving #2, however, is left to the next patch. Note that a subsequent analysis of some hard-to-trigger reflink and COW livelocks on extremely fragmented filesystems (or systems running a lot of IO threads) showed the same symptoms -- uncomfortably large numbers of incore deferred work items and occasional stalls in the transaction grant code while waiting for log reservations. I think this patch and the next one will also solve these problems. As originally written, the code used list_splice_tail_init instead of list_splice_init, so change that, and leave a short comment explaining our actions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07xfs: fix an incore inode UAF in xfs_bui_recoverDarrick J. Wong
In xfs_bui_item_recover, there exists a use-after-free bug with regards to the inode that is involved in the bmap replay operation. If the mapping operation does not complete, we call xfs_bmap_unmap_extent to create a deferred op to finish the unmapping work, and we retain a pointer to the incore inode. Unfortunately, the very next thing we do is commit the transaction and drop the inode. If reclaim tears down the inode before we try to finish the defer ops, we dereference garbage and blow up. Therefore, create a way to join inodes to the defer ops freezer so that we can maintain the xfs_inode reference until we're done with the inode. Note: This imposes the requirement that there be enough memory to keep every incore inode in memory throughout recovery. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07xfs: clean up xfs_bui_item_recover iget/trans_alloc/ilock orderingDarrick J. Wong
In most places in XFS, we have a specific order in which we gather resources: grab the inode, allocate a transaction, then lock the inode. xfs_bui_item_recover doesn't do it in that order, so fix it to be more consistent. This also makes the error bailout code a bit less weird. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07xfs: clean up bmap intent item recovery checkingDarrick J. Wong
The bmap intent item checking code in xfs_bui_item_recover is spread all over the function. We should check the recovered log item at the top before we allocate any resources or do anything else, so do that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07xfs: xfs_defer_capture should absorb remaining transaction reservationDarrick J. Wong
When xfs_defer_capture extracts the deferred ops and transaction state from a transaction, it should record the transaction reservation type from the old transaction so that when we continue the dfops chain, we still use the same reservation parameters. Doing this means that the log item recovery functions get to determine the transaction reservation instead of abusing tr_itruncate in yet another part of xfs. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07xfs: xfs_defer_capture should absorb remaining block reservationsDarrick J. Wong
When xfs_defer_capture extracts the deferred ops and transaction state from a transaction, it should record the remaining block reservations so that when we continue the dfops chain, we can reserve the same number of blocks to use. We capture the reservations for both data and realtime volumes. This adds the requirement that every log intent item recovery function must be careful to reserve enough blocks to handle both itself and all defer ops that it can queue. On the other hand, this enables us to do away with the handwaving block estimation nonsense that was going on in xlog_finish_defer_ops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-10-07xfs: proper replay of deferred ops queued during log recoveryDarrick J. Wong
When we replay unfinished intent items that have been recovered from the log, it's possible that the replay will cause the creation of more deferred work items. As outlined in commit 509955823cc9c ("xfs: log recovery should replay deferred ops in order"), later work items have an implicit ordering dependency on earlier work items. Therefore, recovery must replay the items (both recovered and created) in the same order that they would have been during normal operation. For log recovery, we enforce this ordering by using an empty transaction to collect deferred ops that get created in the process of recovering a log intent item to prevent them from being committed before the rest of the recovered intent items. After we finish committing all the recovered log items, we allocate a transaction with an enormous block reservation, splice our huge list of created deferred ops into that transaction, and commit it, thereby finishing all those ops. This is /really/ hokey -- it's the one place in XFS where we allow nested transactions; the splicing of the defer ops list is is inelegant and has to be done twice per recovery function; and the broken way we handle inode pointers and block reservations cause subtle use-after-free and allocator problems that will be fixed by this patch and the two patches after it. Therefore, replace the hokey empty transaction with a structure designed to capture each chain of deferred ops that are created as part of recovering a single unfinished log intent. Finally, refactor the loop that replays those chains to do so using one transaction per chain. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07xfs: remove XFS_LI_RECOVEREDDarrick J. Wong
The ->iop_recover method of a log intent item removes the recovered intent item from the AIL by logging an intent done item and committing the transaction, so it's superfluous to have this flag check. Nothing else uses it, so get rid of the flag entirely. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-10-07xfs: remove xfs_defer_resetDarrick J. Wong
Remove this one-line helper since the assert is trivially true in one call site and the rest obscures a bitmask operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-09-30xfs: fix finobt btree block recovery orderingDave Chinner
Nathan popped up on #xfs and pointed out that we fail to handle finobt btree blocks in xlog_recover_get_buf_lsn(). This means they always fall through the entire magic number matching code to "recover immediately". Whilst most of the time this is the correct behaviour, occasionally it will be incorrect and could potentially overwrite more recent metadata because we don't check the LSN in the on disk metadata at all. This bug has been present since the finobt was first introduced, and is a potential cause of the occasional xfs_iget_check_free_state() failures we see that indicate that the inode btree state does not match the on disk inode state. Fixes: aafc3c246529 ("xfs: support the XFS_BTNUM_FINOBT free inode btree type") Reported-by: Nathan Scott <nathans@redhat.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-09-25xfs: remove deprecated sysctl optionsxfs-5.10-merge_2020-10-05xfs-5.10-merge_2020-10-04xfs-5.10-merge_2020-10-01xfs-5.10-merge_2020-09-29xfs-5.10-merge_2020-09-27Pavel Reichl
These optionr were for Irix compatibility, probably for clustered XFS clients in a heterogenous cluster which contained both Irix & Linux machines, so that behavior would be consistent. That doesn't exist anymore and it's no longer needed. Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: actually state when the sysctls go away] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-25xfs: remove deprecated mount optionsPavel Reichl
ikeep/noikeep was a workaround for old DMAPI code which is no longer relevant. attr2/noattr2 - is for controlling upgrade behaviour from fixed attribute fork sizes in the inode (attr1) and dynamic attribute fork sizes (attr2). mkfs has defaulted to setting attr2 since 2007, hence just about every XFS filesystem out there in production right now uses attr2. Signed-off-by: Pavel Reichl <preichl@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix minor typos] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-25xfs: directly call xfs_generic_create() for ->create() and ->mkdir()Kaixu Xia
The current create and mkdir handlers both call the xfs_vn_mknod() which is a wrapper routine around xfs_generic_create() function. Actually the create and mkdir handlers can directly call xfs_generic_create() function and reduce the call chain. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-25xfs: avoid shared rmap operations for attr fork extentsDarrick J. Wong
During code review, I noticed that the rmap code uses the (slower) shared mappings rmap functions for any extent of a reflinked file, even if those extents are for the attr fork, which doesn't support sharing. We can speed up rmap a tiny bit by optimizing out this case. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-25xfs: drop the obsolete comment on filestream lockingGao Xiang
Since commit 1c1c6ebcf52 ("xfs: Replace per-ag array with a radix tree"), there is no m_peraglock anymore, so it's hard to understand the described situation since per-ag is no longer an array and no need to reallocate, call xfs_filestream_flush() in growfs. In addition, the race condition for shrink feature is quite confusing to me currently as well. Get rid of it instead. Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-25xfs: code cleanup in xfs_attr_leaf_entsize_{remote,local}Kaixu Xia
Cleanup the typedef usage, the unnecessary parentheses, the unnecessary backslash and use the open-coded round_up call in xfs_attr_leaf_entsize_{remote,local}. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-25xfs: do the assert for all the log done items in xfs_trans_cancelKaixu Xia
We should do the assert for all the log intent-done items if they appear here. This patch detect intent-done items by the fact that their item ops don't have iop_unpin and iop_push methods and also move the helper xlog_item_is_intent to xfs_trans.h. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-25xfs: remove the unused parameter id from xfs_qm_dqattach_oneKaixu Xia
Since we never use the second parameter id, so remove it from xfs_qm_dqattach_one() function. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-25xfs: remove the redundant crc feature check in xfs_attr3_rmt_verifyKaixu Xia
We already check whether the crc feature is enabled before calling xfs_attr3_rmt_verify(), so remove the redundant feature check in that function. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-25xfs: fix some commentsKaixu Xia
Fix the comments to help people understand the code. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> [darrick: fix the indenting problems too] Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-25xfs: remove the unnecessary xfs_dqid_t type castKaixu Xia
Since the type prid_t and xfs_dqid_t both are uint32_t, seems the type cast is unnecessary, so remove it. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-25xfs: use the existing type definition for di_projidKaixu Xia
We have already defined the project ID type prid_t, so maybe should use it here. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-25xfs: remove the unused SYNCHRONIZE macroKaixu Xia
There are no callers of the SYNCHRONIZE() macro, so remove it. Signed-off-by: Kaixu Xia <kaixuxia@tencent.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-23xfs: clean up calculation of LR header blocksGao Xiang
Let's use DIV_ROUND_UP() to calculate log record header blocks as what did in xlog_get_iclog_buffer_size() and wrap up a common helper for log recovery. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-23xfs: avoid LR buffer overrun due to crafted h_lenGao Xiang
Currently, crafted h_len has been blocked for the log header of the tail block in commit a70f9fe52daa ("xfs: detect and handle invalid iclog size set by mkfs"). However, each log record could still have crafted h_len and cause log record buffer overrun. So let's check h_len vs buffer size for each log record as well. Signed-off-by: Gao Xiang <hsiangkao@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-09-23xfs: don't release log intent items when recovery failsDarrick J. Wong
Nowadays, log recovery will call ->release on the recovered intent items if recovery fails. Therefore, it's redundant to release them from inside the ->recover functions when they're about to return an error. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-23xfs: attach inode to dquot in xfs_bui_item_recoverDarrick J. Wong
In the bmap intent item recovery code, we must be careful to attach the inode to its dquots (if quotas are enabled) so that a change in the shape of the bmap btree doesn't cause the quota counters to be incorrect. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-23xfs: log new intent items created as part of finishing recovered intent itemsDarrick J. Wong
During a code inspection, I found a serious bug in the log intent item recovery code when an intent item cannot complete all the work and decides to requeue itself to get that done. When this happens, the item recovery creates a new incore deferred op representing the remaining work and attaches it to the transaction that it allocated. At the end of _item_recover, it moves the entire chain of deferred ops to the dummy parent_tp that xlog_recover_process_intents passed to it, but fail to log a new intent item for the remaining work before committing the transaction for the single unit of work. xlog_finish_defer_ops logs those new intent items once recovery has finished dealing with the intent items that it recovered, but this isn't sufficient. If the log is forced to disk after a recovered log item decides to requeue itself and the system goes down before we call xlog_finish_defer_ops, the second log recovery will never see the new intent item and therefore has no idea that there was more work to do. It will finish recovery leaving the filesystem in a corrupted state. The same logic applies to /any/ deferred ops added during intent item recovery, not just the one handling the remaining work. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-23xfs: check dabtree node hash values when loading child blocksDarrick J. Wong
When xchk_da_btree_block is loading a non-root dabtree block, we know that the parent block had to have a (hashval, address) pointer to the block that we just loaded. Check that the hashval in the parent matches the block we just loaded. This was found by fuzzing nbtree[3].hashval = ones in xfs/394. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-23xfs: don't free rt blocks when we're doing a REMAP bunmapi callDarrick J. Wong
When callers pass XFS_BMAPI_REMAP into xfs_bunmapi, they want the extent to be unmapped from the given file fork without the extent being freed. We do this for non-rt files, but we forgot to do this for realtime files. So far this isn't a big deal since nobody makes a bunmapi call to a rt file with the REMAP flag set, but don't leave a logic bomb. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2020-09-23xfs: Set xfs_buf's b_ops member when zeroing bitmap/summary filesChandan Babu R
In xfs_growfs_rt(), we enlarge bitmap and summary files by allocating new blocks for both files. For each of the new blocks allocated, we allocate an xfs_buf, zero the payload, log the contents and commit the transaction. Hence these buffers will eventually find themselves appended to list at xfs_ail->ail_buf_list. Later, xfs_growfs_rt() loops across all of the new blocks belonging to the bitmap inode to set the bitmap values to 1. In doing so, it allocates a new transaction and invokes the following sequence of functions, - xfs_rtfree_range() - xfs_rtmodify_range() - xfs_rtbuf_get() We pass '&xfs_rtbuf_ops' as the ops pointer to xfs_trans_read_buf(). - xfs_trans_read_buf() We find the xfs_buf of interest in per-ag hash table, invoke xfs_buf_reverify() which ends up assigning '&xfs_rtbuf_ops' to xfs_buf->b_ops. On the other hand, if xfs_growfs_rt_alloc() had allocated a few blocks for the bitmap inode and returned with an error, all the xfs_bufs corresponding to the new bitmap blocks that have been allocated would continue to be on xfs_ail->ail_buf_list list without ever having a non-NULL value assigned to their b_ops members. An AIL flush operation would then trigger the following warning message to be printed on the console, XFS (loop0): _xfs_buf_ioapply: no buf ops on daddr 0x58 len 8 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ CPU: 3 PID: 449 Comm: xfsaild/loop0 Not tainted 5.8.0-rc4-chandan-00038-g4d8c2b9de9ab-dirty #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0x57/0x70 _xfs_buf_ioapply+0x37c/0x3b0 ? xfs_rw_bdev+0x1e0/0x1e0 ? xfs_buf_delwri_submit_buffers+0xd4/0x210 __xfs_buf_submit+0x6d/0x1f0 xfs_buf_delwri_submit_buffers+0xd4/0x210 xfsaild+0x2c8/0x9e0 ? __switch_to_asm+0x42/0x70 ? xfs_trans_ail_cursor_first+0x80/0x80 kthread+0xfe/0x140 ? kthread_park+0x90/0x90 ret_from_fork+0x22/0x30 This message indicates that the xfs_buf had its b_ops member set to NULL. This commit fixes the issue by assigning "&xfs_rtbuf_ops" to b_ops member of each of the xfs_bufs logged by xfs_growfs_rt_alloc(). Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-21xfs: Set xfs_buf type flag when growing summary/bitmap filesChandan Babu R
The following sequence of commands, mkfs.xfs -f -m reflink=0 -r rtdev=/dev/loop1,size=10M /dev/loop0 mount -o rtdev=/dev/loop1 /dev/loop0 /mnt xfs_growfs /mnt ... causes the following call trace to be printed on the console, XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) || (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 331 Call Trace: xfs_buf_item_format+0x632/0x680 ? kmem_alloc_large+0x29/0x90 ? kmem_alloc+0x70/0x120 ? xfs_log_commit_cil+0x132/0x940 xfs_log_commit_cil+0x26f/0x940 ? xfs_buf_item_init+0x1ad/0x240 ? xfs_growfs_rt_alloc+0x1fc/0x280 __xfs_trans_commit+0xac/0x370 xfs_growfs_rt_alloc+0x1fc/0x280 xfs_growfs_rt+0x1a0/0x5e0 xfs_file_ioctl+0x3fd/0xc70 ? selinux_file_ioctl+0x174/0x220 ksys_ioctl+0x87/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x3e/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This occurs because the buffer being formatted has the value of XFS_BLFT_UNKNOWN_BUF assigned to the 'type' subfield of bip->bli_formats->blf_flags. This commit fixes the issue by assigning one of XFS_BLFT_RTSUMMARY_BUF and XFS_BLFT_RTBITMAP_BUF to the 'type' subfield of bip->bli_formats->blf_flags before committing the corresponding transaction. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-21xfs: drop extra transaction roll from inode extent truncateBrian Foster
The inode extent truncate path unmaps extents from the inode block mapping, finishes deferred ops to free the associated extents and then explicitly rolls the transaction before processing the next extent. The latter extent roll is spurious as xfs_defer_finish() always returns a clean transaction and automatically relogs inodes attached to the transaction (with lock_flags == 0). This can unnecessarily increase the number of log ticket regrants that occur during a long running truncate operation. Remove the explicit transaction roll. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: deprecate the V4 formatDarrick J. Wong
The V4 filesystem format contains known weaknesses in the on-disk format that make metadata verification diffiult. In addition, the format does not support dates past 2038 and will not be upgraded to do so. We should start the process of retiring the old format to close off attack surfaces and to encourage users to migrate onto V5. Therefore, make XFS V4 support a configurable option. For the first period it will be default Y in case some distributors want to withdraw support early; for the second period it will be default N so that anyone who wishes to continue support can do so; and after that, support will be removed from the kernel. Dates for these events have been added to the upstream kernel. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2020-09-15xfs: don't propagate RTINHERIT -> REALTIME when there is no rtdevDarrick J. Wong
While running generic/042 with -drtinherit=1 set in MKFS_OPTIONS, I observed that the kernel will gladly set the realtime flag on any file created on the loopback filesystem even though that filesystem doesn't actually have a realtime device attached. This leads to verifier failures and doesn't make any sense, so be smarter about this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-15xfs: refactor inode flags propagation codeDarrick J. Wong
Hoist the code that propagates di_flags and di_flags2 from a parent to a new child into separate functions. No functional changes. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-15xfs: ensure that fpunch, fcollapse, and finsert operations are aligned to rt ↵xfs-5.10-merge-2Darrick J. Wong
extent size Make sure that any fallocate operation that requires the range to be block-aligned also checks that the range is aligned to the realtime extent size. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-15xfs: make sure the rt allocator doesn't run off the endDarrick J. Wong
There's an overflow bug in the realtime allocator. If the rt volume is large enough to handle a single allocation request that is larger than the maximum bmap extent length and the rt bitmap ends exactly on a bitmap block boundary, it's possible that the near allocator will try to check the freeness of a range that extends past the end of the bitmap. This fails with a corruption error and shuts down the fs. Therefore, constrain maxlen so that the range scan cannot run off the end of the rt bitmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-15xfs: Remove unneeded semicolonZheng Bin
Fixes coccicheck warning: fs/xfs/xfs_icache.c:1214:2-3: Unneeded semicolon Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: force the log after remapping a synchronous-writes fileDarrick J. Wong
Commit 5833112df7e9 tried to make it so that a remap operation would force the log out to disk if the filesystem is mounted with mandatory synchronous writes. Unfortunately, that commit failed to handle the case where the inode or the file descriptor require mandatory synchronous writes. Refactor the check into into a helper that will look for all three conditions, and now we can treat reflink just like any other synchronous write. Fixes: 5833112df7e9 ("xfs: reflink should force the log out if mounted with wsync") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2020-09-15xfs: Convert xfs_attr_sf macros to inline functionsCarlos Maiolino
xfs_attr_sf_totsize() requires access to xfs_inode structure, so, once xfs_attr_shortform_addname() is its only user, move it to xfs_attr.c instead of playing with more #includes. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: Use variable-size array for nameval in xfs_attr_sf_entryCarlos Maiolino
nameval is a variable-size array, so, define it as it, and remove all the -1 magic number subtractions Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-09-15xfs: Remove typedef xfs_attr_shortform_tCarlos Maiolino
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>