summaryrefslogtreecommitdiff
path: root/fs/xfs
AgeCommit message (Collapse)Author
2017-06-27xfs: grab dquots without taking the ilockxfs-4.13-merge-1Darrick J. Wong
Add a new dqget flag that grabs the dquot without taking the ilock. This will be used by the scrubber (which will have already grabbed the ilock) to perform basic sanity checking of the quota data. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-27xfs: fix semicolon.cocci warningskbuild test robot
fs/xfs/xfs_log.c:2092:38-39: Unneeded semicolon Remove unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Fixes: d4ca1d550d05 ("xfs: dump transaction usage details on log reservation overrun") CC: Brian Foster <bfoster@redhat.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-27xfs: Don't clear SGID when inheriting ACLsJan Kara
When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by calling __xfs_set_acl() instead of xfs_set_acl() when setting up inode in xfs_generic_create(). That prevents SGID bit clearing and mode is properly set by posix_acl_create() anyway. We also reorder arguments of __xfs_set_acl() to match the ordering of xfs_set_acl() to make things consistent. Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: stable@vger.kernel.org CC: Darrick J. Wong <darrick.wong@oracle.com> CC: linux-xfs@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-27xfs: free cowblocks and retry on buffered write ENOSPCBrian Foster
XFS runs an eofblocks reclaim scan before returning an ENOSPC error to userspace for buffered writes. This facilitates aggressive speculative preallocation without causing user visible side effects such as premature ENOSPC. Run a cowblocks scan in the same situation to reclaim lingering COW fork preallocation throughout the filesystem. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-27xfs: replace log_badcrc_factor knob with error injection tagBrian Foster
Now that error injection tags support dynamic frequency adjustment, replace the debug mode sysfs knob that controls log record CRC error injection with an error injection tag. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-27xfs: convert drop_writes to use the errortag mechanismDarrick J. Wong
We now have enhanced error injection that can control the frequency with which errors happen, so convert drop_writes to use this. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-27xfs: remove unneeded parameter from XFS_TEST_ERRORDarrick J. Wong
Since we moved the injected error frequency controls to the mountpoint, we can get rid of the last argument to XFS_TEST_ERROR. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-27xfs: expose errortag knobs via sysfsDarrick J. Wong
Creates a /sys/fs/xfs/$dev/errortag/ directory to control the errortag values directly. This enables us to control the randomness values, rather than having to accept the defaults. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-27xfs: make errortag a per-mountpoint structureDarrick J. Wong
Remove the xfs_etest structure in favor of a per-mountpoint structure. This will give us the flexibility to set as many error injection points as we want, and later enable us to set up sysfs knobs to set the trigger frequency as we wish. This comes at a cost of higher memory use, but unti we hit 1024 injection points (we're at 29) or a lot of mounts this shouldn't be a huge issue. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
2017-06-24xfs: free uncommitted transactions during log recoveryBrian Foster
Log recovery allocates in-core transaction and member item data structures on-demand as it processes the on-disk log. Transactions are allocated on first encounter on-disk and stored in a hash table structure where they are easily accessible for subsequent lookups. Transaction items are also allocated on demand and are attached to the associated transactions. When a commit record is encountered in the log, the transaction is committed to the fs and the in-core structures are freed. If a filesystem crashes or shuts down before all in-core log buffers are flushed to the log, however, not all transactions may have commit records in the log. As expected, the modifications in such an incomplete transaction are not replayed to the fs. The in-core data structures for the partial transaction are never freed, however, resulting in a memory leak. Update xlog_do_recovery_pass() to first correctly initialize the hash table array so empty lists can be distinguished from populated lists on function exit. Update xlog_recover_free_trans() to always remove the transaction from the list prior to freeing the associated memory. Finally, walk the hash table of transaction lists as the last step before it goes out of scope and free any transactions that may remain on the lists. This prevents a memory leak of partial transactions in the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-20xfs: don't allow bmap on rt filesDarrick J. Wong
bmap returns a dumb LBA address but not the block device that goes with that LBA. Swapfiles don't care about this and will blindly assume that the data volume is the correct blockdev, which is totally bogus for files on the rt subvolume. This results in the swap code doing IOs to arbitrary locations on the data device(!) if the passed in mapping is a realtime file, so just turn off bmap for rt files. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-20xfs: allow reading of already-locked remote symbolic linkDarrick J. Wong
Expose the readlink variant that doesn't take the inode lock so that the scrubber can inspect symlink contents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: pass along transaction context when reading xattr block buffersDarrick J. Wong
Teach the extended attribute reading functions to pass along a transaction context if one was supplied. The extended attribute scrub code will use transactions to lock buffers and avoid deadlocking with itself in the case of loops; since it will already have the inode locked, also create xattr get/list helpers that don't take locks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: pass along transaction context when reading directory block buffersDarrick J. Wong
Teach the directory reading functions to pass along a transaction context if one was supplied. The directory scrub code will use transactions to lock buffers and avoid deadlocking with itself in the case of loops. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: return the hash value of a leaf1 directory blockDarrick J. Wong
Modify the existing dir leafn lasthash function to enable us to calculate the highest hash value of a leaf1 block. This will be used by the directory scrubbing code to check the sanity of hashes in leaf1 directory blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: refactor the ifork block counting functionDarrick J. Wong
Refactor the inode fork block counting function to count extents for us at the same time. This will be used by the bmbt scrubber function. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-20xfs: make _bmap_count_blocks consistent wrt delalloc extent behaviorDarrick J. Wong
There is an inconsistency in the way that _bmap_count_blocks deals with delalloc reservations -- if the specified fork is in extents format, *count is set to the total number of blocks referenced by the in-core fork, including delalloc extents. However, if the fork is in btree format, *count is set to the number of blocks referenced by the on-disk fork, which does /not/ include delalloc extents. For the lone existing caller of _bmap_count_blocks this hasn't been an issue because the function is only used to count xattr fork blocks (where there aren't any delalloc reservations). However, when scrub comes along it will use this same function to check di_nblocks against both on-disk extent maps, so we need this behavior to be consistent. Therefore, fix _bmap_count_leaves not to include delalloc extents and remove unnecessary parameters. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: separate function to check if inode shares extentsDarrick J. Wong
Separate the "clear reflink flag" function into one function that checks if the flag is needed, and a second function that checks and clears the flag. The inode scrub code will want to check the necessity of the flag without clearing it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: reflink find shared should take a transactionDarrick J. Wong
Adapt _reflink_find_shared to take an optional transaction pointer. The inode scrubber code will need to decide (within transaction context) if a file has shared blocks. To avoid buffer deadlocks, we must pass the tp through to this function's utility calls. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: check if an inode is cached and allocatedDarrick J. Wong
Check the inode cache for a particular inode number. If it's in the cache, check that it's not currently being reclaimed. If it's not being reclaimed, return zero if the inode is allocated. This function will be used by various scrubbers to decide if the cache is more up to date than the disk in terms of checking if an inode is allocated. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: export _inobt_btrec_to_irec and _ialloc_cluster_alignment for scrubDarrick J. Wong
Create a function to extract an in-core inobt record from a generic btree_rec union so that scrub will be able to check inobt records and check inode block alignment. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: plumb in needed functions for range querying of various btreesDarrick J. Wong
Plumb in the pieces (init_high_key, diff_two_keys) necessary to call query_range on the inode space and block mapping btrees and to extract raw btree records. This will eventually be used by the inobt and bmbt scrubbers. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: export various function for the online scrubberDarrick J. Wong
Export various internal functions so that the online scrubber can use them to check the state of metadata. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: always compile the btree inorder check functionsDarrick J. Wong
The btree record and key inorder check functions will be used by the btree scrubber code, so make sure they're always built. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: remove double-underscore integer typesDarrick J. Wong
This is a purely mechanical patch that removes the private __{u,}int{8,16,32,64}_t typedefs in favor of using the system {u,}int{8,16,32,64}_t typedefs. This is the sed script used to perform the transformation and fix the resulting whitespace and indentation errors: s/typedef\t__uint8_t/typedef __uint8_t\t/g s/typedef\t__uint/typedef __uint/g s/typedef\t__int\([0-9]*\)_t/typedef int\1_t\t/g s/__uint8_t\t/__uint8_t\t\t/g s/__uint/uint/g s/__int\([0-9]*\)_t\t/__int\1_t\t\t/g s/__int/int/g /^typedef.*int[0-9]*_t;$/d Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: optimize _btree_query_allDarrick J. Wong
Don't bother wandering our way through the leaf nodes when the caller issues a query_all; just zoom down the left side of the tree and walk rightwards along level zero. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-06-19xfs: remove bli from AIL before release on transaction abortBrian Foster
When a buffer is modified, logged and committed, it ultimately ends up sitting on the AIL with a dirty bli waiting for metadata writeback. If another transaction locks and invalidates the buffer (freeing an inode chunk, for example) in the meantime, the bli is flagged as stale, the dirty state is cleared and the bli remains in the AIL. If a shutdown occurs before the transaction that has invalidated the buffer is committed, the transaction is ultimately aborted. The log items are flagged as such and ->iop_unlock() handles the aborted items. Because the bli is clean (due to the invalidation), ->iop_unlock() unconditionally releases it. The log item may still reside in the AIL, however, which means the I/O completion handler may still run and attempt to access it. This results in assert failure due to the release of the bli while still present in the AIL and a subsequent NULL dereference and panic in the buffer I/O completion handling. This can be reproduced by running generic/388 in repetition. To avoid this problem, update xfs_buf_item_unlock() to first check whether the bli is aborted and if so, remove it from the AIL before it is released. This ensures that the bli is no longer accessed during the shutdown sequence after it has been freed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: release bli from transaction properly on fs shutdownBrian Foster
If a filesystem shutdown occurs with a buffer log item in the CIL and a log force occurs, the ->iop_unpin() handler is generally expected to tear down the bli properly. This entails freeing the bli memory and releasing the associated hold on the buffer so it can be released and the filesystem unmounted. If this sequence occurs while ->bli_refcount is elevated (i.e., another transaction is open and attempting to modify the buffer), however, ->iop_unpin() may not be responsible for releasing the bli. Instead, the transaction may release the final ->bli_refcount reference and thus xfs_trans_brelse() is responsible for tearing down the bli. While xfs_trans_brelse() does drop the reference count, it only attempts to release the bli if it is clean (i.e., not in the CIL/AIL). If the filesystem is shutdown and the bli is sitting dirty in the CIL as noted above, this ends up skipping the last opportunity to release the bli. In turn, this leaves the hold on the buffer and causes an unmount hang. This can be reproduced by running generic/388 in repetition. Update xfs_trans_brelse() to handle this shutdown corner case correctly. If the final bli reference is dropped and the filesystem is shutdown, remove the bli from the AIL (if necessary) and release the bli to drop the buffer hold and ensure an unmount does not hang. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: avoid harmless gcc-7 warningsArnd Bergmann
gcc-7 flags the use of integer math inside of a condition as a potential bug: fs/xfs/xfs_bmap_util.c: In function 'xfs_swap_extents_check_format': fs/xfs/xfs_bmap_util.c:1619:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context] fs/xfs/xfs_bmap_util.c:1629:8: error: '<<' in boolean context, did you mean '<' ? [-Werror=int-in-bool-context] There is already a helper function for testing the di_forkoff field for zero, so let's use that instead to shut up the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: remove lsn relevant fields from xfs_trans structure and its usersShan Hai
The t_lsn is not used anymore and the t_commit_lsn is used as a tmp storage for the checkpoint sequence number only in the current code. And the start/commit lsn are tracked as a transaction group tag in the xfs_cil_ctx instead of a single transaction, so remove them from the xfs_trans structure and their users to match with the design. Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: remove XFS_HSIZEChristoph Hellwig
XFS_HSIZE is an extremly confusing way to calculate the size of handle_t. Given that handle_t always only had two sizes, and one of them isn't even covered by XFS_HSIZE to start with just remove the macro and use a constant sizeof expression. Note that XFS_HSIZE isn't used in xfsprogs, xfsdump or xfstests either. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: dump transaction usage details on log reservation overrunBrian Foster
If a transaction log reservation overrun occurs, the ticket data associated with the reservation is dumped in xfs_log_commit_cil(). This occurs long after the transaction items and details have been removed from the transaction and effectively lost. This limited set of ticket data provides very little information to support debugging transaction overruns based on the typical report. To improve transaction log reservation overrun reporting, create a helper to dump transaction details such as log items, log vector data, etc., as well as the underlying ticket data for the transaction. Move the overrun detection from xfs_log_commit_cil() to xlog_cil_insert_items() so it occurs prior to migration of the logged items to the CIL. Call the new helper such that it is able to dump this transaction data before it is lost. Also, warn on overrun to provide callstack context for the offending transaction and include a few additional messages from xlog_cil_insert_items() to display the reservation consumed locally for overhead such as log vector headers, split region headers and the context ticket. This provides a complete general breakdown of the reservation consumption of a transaction when/if it happens to overrun the reservation. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: refactor xlog_cil_insert_items() to facilitate transaction dumpBrian Foster
Transaction reservation overrun detection currently occurs too late to print useful information about the offending transaction. Ideally, the transaction data is printed before the associated log items are moved from the transaction to the CIL, which occurs in xlog_cil_insert_items(), such that details of the items logged by the transaction are available for analysis. Refactor xlog_cil_insert_items() to facilitate moving tx overrun detection to this function. Update the function to track each bit of extra log reservation stolen from the transaction (i.e., such as for the CIL context ticket) and perform the log item migration as the last operation before the CIL lock is released. This creates a context where the transaction reservation consumption has been fully calculated when the log items are moved to the CIL. This patch makes no functional changes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: separate shutdown from ticket reservation print helperBrian Foster
xlog_print_tic_res() pre-dates delayed logging and the committed items list (CIL) and thus retains some factoring warts, such as hard coded function names in the output and the fact that it induces a shutdown. In preparation for more detailed logging of regular transaction overrun situations, refactor xlog_print_tic_res() to be slightly more generic. Reword some of the warning messages and pull the shutdown into the callers. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: define fatal assert build time tunableBrian Foster
While configurable at runtime, the DEBUG mode assert failure behavior is usually either desired or not for a particular situation. For example, developers using kernel modules may prefer for fatal asserts to remain disabled across module reloads while QE engineers doing broad regression testing may prefer to have fatal asserts enabled on boot to facilitate data collection for bug reports. To provide a compromise/convenience for developers, create a Kconfig option that sets the default value of the DEBUG mode 'bug_on_assert' sysfs tunable. The default behavior remains to trigger kernel BUGs on assert failures to preserve existing behavior across kernel configuration updates with DEBUG mode enabled. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: define bug_on_assert debug mode sysfs tunableBrian Foster
In DEBUG mode, assert failures unconditionally trigger a kernel BUG. This is useful in diagnostic situations to panic a system and collect detailed state information at the time of a failure. This can also cause problems in cases where DEBUG mode code is desired but it is preferable not trigger kernel BUGs on assert failure. For example, during development of new code or during certain xfstests tests that intentionally cause corruption and test the kernel for survival (but otherwise may expect to trigger assert failures). To provide additional flexibility, create the <sysfs>/fs/xfs/debug/bug_on_assert tunable to configure assert failure behavior at runtime. This tunable is only available in DEBUG mode and is enabled by default to preserve existing default behavior. When disabled, assert failures in DEBUG mode result in kernel warnings. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-19xfs: try to avoid blowing out the transaction reservation when bunmaping a ↵Darrick J. Wong
shared extent In a pathological scenario where we are trying to bunmapi a single extent in which every other block is shared, it's possible that trying to unmap the entire large extent in a single transaction can generate so many EFIs that we overflow the transaction reservation. Therefore, use a heuristic to guess at the number of blocks we can safely unmap from a reflink file's data fork in an single transaction. This should prevent problems such as the log head slamming into the tail and ASSERTs that trigger because we've exceeded the transaction reservation. Note that since bunmapi can fail to unmap the entire range, we must also teach the deferred unmap code to roll into a new transaction whenever we get low on reservation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> [hch: random edits, all bugs are my fault] Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: refactor dir2 leaf readahead shadow buffer clevernessDarrick J. Wong
Currently, the dir2 leaf block getdents function uses a complex state tracking mechanism to create a shadow copy of the block mappings and then uses the shadow copy to schedule readahead. Since the read and readahead functions are perfectly capable of reading the mappings themselves, we can tear all that out in favor of a simpler function that simply keeps pushing the readahead window further out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-06-19xfs: push buffer of flush locked dquot to avoid quotacheck deadlockBrian Foster
Reclaim during quotacheck can lead to deadlocks on the dquot flush lock: - Quotacheck populates a local delwri queue with the physical dquot buffers. - Quotacheck performs the xfs_qm_dqusage_adjust() bulkstat and dirties all of the dquots. - Reclaim kicks in and attempts to flush a dquot whose buffer is already queud on the quotacheck queue. The flush succeeds but queueing to the reclaim delwri queue fails as the backing buffer is already queued. The flush unlock is now deferred to I/O completion of the buffer from the quotacheck queue. - The dqadjust bulkstat continues and dirties the recently flushed dquot once again. - Quotacheck proceeds to the xfs_qm_flush_one() walk which requires the flush lock to update the backing buffers with the in-core recalculated values. It deadlocks on the redirtied dquot as the flush lock was already acquired by reclaim, but the buffer resides on the local delwri queue which isn't submitted until the end of quotacheck. This is reproduced by running quotacheck on a filesystem with a couple million inodes in low memory (512MB-1GB) situations. This is a regression as of commit 43ff2122e6 ("xfs: on-stack delayed write buffer lists"), which removed a trylock and buffer I/O submission from the quotacheck dquot flush sequence. Quotacheck first resets and collects the physical dquot buffers in a delwri queue. Then, it traverses the filesystem inodes via bulkstat, updates the in-core dquots, flushes the corrected dquots to the backing buffers and finally submits the delwri queue for I/O. Since the backing buffers are queued across the entire quotacheck operation, dquot reclaim cannot possibly complete a dquot flush before quotacheck completes. Therefore, quotacheck must submit the buffer for I/O in order to cycle the flush lock and flush the dirty in-core dquot to the buffer. Add a delwri queue buffer push mechanism to submit an individual buffer for I/O without losing the delwri queue status and use it from quotacheck to avoid the deadlock. This restores quotacheck behavior to as before the regression was introduced. Reported-by: Martin Svec <martin.svec@zoner.cz> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-17Merge tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs fix from Darrick Wong: "One more bugfix for you for 4.12-rc6 to fix something that came up in an earlier rc: - Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y" * tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
2017-06-08xfs: fix spurious spin_is_locked() assert failures on non-smp kernelsxfs-4.12-fixes-4Brian Foster
The 0-day kernel test robot reports assertion failures on !CONFIG_SMP kernels due to failed spin_is_locked() checks. As it turns out, spin_is_locked() is hardcoded to return zero on !CONFIG_SMP kernels and so this function cannot be relied on to verify spinlock state in this configuration. To avoid this problem, replace the associated asserts with lockdep variants that do the right thing regardless of kernel configuration. Drop the one assert that checks for an unlocked lock as there is no suitable lockdep variant for that case. This moves the spinlock checks from XFS debug code to lockdep, but generally provides the same level of protection. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-02Merge tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull XFS fix from Darrick Wong: "I've one more bugfix for you for 4.12-rc4: Fix an unmount hang due to a race in io buffer accounting" * tag 'xfs-4.12-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: use ->b_state to fix buffer I/O accounting release race
2017-05-31xfs: use ->b_state to fix buffer I/O accounting release racexfs-4.12-fixes-3Brian Foster
We've had user reports of unmount hangs in xfs_wait_buftarg() that analysis shows is due to btp->bt_io_count == -1. bt_io_count represents the count of in-flight asynchronous buffers and thus should always be >= 0. xfs_wait_buftarg() waits for this value to stabilize to zero in order to ensure that all untracked (with respect to the lru) buffers have completed I/O processing before unmount proceeds to tear down in-core data structures. The value of -1 implies an I/O accounting decrement race. Indeed, the fact that xfs_buf_ioacct_dec() is called from xfs_buf_rele() (where the buffer lock is no longer held) means that bp->b_flags can be updated from an unsafe context. While a user-level reproducer is currently not available, some intrusive hacks to run racing buffer lookups/ioacct/releases from multiple threads was used to successfully manufacture this problem. Existing callers do not expect to acquire the buffer lock from xfs_buf_rele(). Therefore, we can not safely update ->b_flags from this context. It turns out that we already have separate buffer state bits and associated serialization for dealing with buffer LRU state in the form of ->b_state and ->b_lock. Therefore, replace the _XBF_IN_FLIGHT flag with a ->b_state variant, update the I/O accounting wrappers appropriately and make sure they are used with the correct locking. This ensures that buffer in-flight state can be modified at buffer release time without racing with modifications from a buffer lock holder. Fixes: 9c7504aa72b6 ("xfs: track and serialize in-flight async buffers against unmount") Cc: <stable@vger.kernel.org> # v4.8+ Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Libor Pechacek <lpechacek@suse.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-26Merge tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull XFS fixes from Darrick Wong: "A few miscellaneous bug fixes & cleanups: - Fix indlen block reservation accounting bug when splitting delalloc extent - Fix warnings about unused variables that appeared in -rc1. - Don't spew errors when bmapping a local format directory - Fix an off-by-one error in a delalloc eof assertion - Make fsmap only return inode information for CAP_SYS_ADMIN - Fix a potential mount time deadlock recovering cow extents - Fix unaligned memory access in _btree_visit_blocks - Fix various SEEK_HOLE/SEEK_DATA bugs" * tag 'xfs-4.12-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Move handling of missing page into one place in xfs_find_get_desired_pgoff() xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff() xfs: Fix missed holes in SEEK_HOLE implementation xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff() xfs: fix unaligned access in xfs_btree_visit_blocks xfs: avoid mount-time deadlock in CoW extent recovery xfs: only return detailed fsmap info if the caller has CAP_SYS_ADMIN xfs: bad assertion for delalloc an extent that start at i_size xfs: fix warnings about unused stack variables xfs: BMAPX shouldn't barf on inline-format directories xfs: fix indlen accounting error on partial delalloc conversion
2017-05-25xfs: Move handling of missing page into one place in ↵xfs-4.12-fixes-2Jan Kara
xfs_find_get_desired_pgoff() Currently several places in xfs_find_get_desired_pgoff() handle the case of a missing page. Make them all handled in one place after the loop has terminated. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25xfs: Fix off-by-in in loop termination in xfs_find_get_desired_pgoff()Jan Kara
There is an off-by-one error in loop termination conditions in xfs_find_get_desired_pgoff() since 'end' may index a page beyond end of desired range if 'endoff' is page aligned. It doesn't have any visible effects but still it is good to fix it. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25xfs: Fix missed holes in SEEK_HOLE implementationJan Kara
XFS SEEK_HOLE implementation could miss a hole in an unwritten extent as can be seen by the following command: xfs_io -c "falloc 0 256k" -c "pwrite 0 56k" -c "pwrite 128k 8k" -c "seek -h 0" file wrote 57344/57344 bytes at offset 0 56 KiB, 14 ops; 0.0000 sec (49.312 MiB/sec and 12623.9856 ops/sec) wrote 8192/8192 bytes at offset 131072 8 KiB, 2 ops; 0.0000 sec (70.383 MiB/sec and 18018.0180 ops/sec) Whence Result HOLE 139264 Where we can see that hole at offset 56k was just ignored by SEEK_HOLE implementation. The bug is in xfs_find_get_desired_pgoff() which does not properly detect the case when pages are not contiguous. Fix the problem by properly detecting when found page has larger offset than expected. CC: stable@vger.kernel.org Fixes: d126d43f631f996daeee5006714fed914be32368 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25xfs: fix off-by-one on max nr_pages in xfs_find_get_desired_pgoff()Eryu Guan
xfs_find_get_desired_pgoff() is used to search for offset of hole or data in page range [index, end] (both inclusive), and the max number of pages to search should be at least one, if end == index. Otherwise the only page is missed and no hole or data is found, which is not correct. When block size is smaller than page size, this can be demonstrated by preallocating a file with size smaller than page size and writing data to the last block. E.g. run this xfs_io command on a 1k block size XFS on x86_64 host. # xfs_io -fc "falloc 0 3k" -c "pwrite 2k 1k" \ -c "seek -d 0" /mnt/xfs/testfile wrote 1024/1024 bytes at offset 2048 1 KiB, 1 ops; 0.0000 sec (33.675 MiB/sec and 34482.7586 ops/sec) Whence Result DATA EOF Data at offset 2k was missed, and lseek(2) returned ENXIO. This is uncovered by generic/285 subtest 07 and 08 on ppc64 host, where pagesize is 64k. Because a recent change to generic/285 reduced the preallocated file size to smaller than 64k. Cc: stable@vger.kernel.org # v3.7+ Signed-off-by: Eryu Guan <eguan@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-25xfs: fix unaligned access in xfs_btree_visit_blocksEric Sandeen
This structure copy was throwing unaligned access warnings on sparc64: Kernel unaligned access at TPC[1043c088] xfs_btree_visit_blocks+0x88/0xe0 [xfs] xfs_btree_copy_ptrs does a memcpy, which avoids it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-05-19xfs: avoid mount-time deadlock in CoW extent recoveryxfs-4.12-fixes-1Darrick J. Wong
If a malicious user corrupts the refcount btree to cause a cycle between different levels of the tree, the next mount attempt will deadlock in the CoW recovery routine while grabbing buffer locks. We can use the ability to re-grab a buffer that was previous locked to a transaction to avoid deadlocks, so do that here. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>