summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_mount.h
AgeCommit message (Collapse)Author
2021-10-19xfs: compute maximum AG btree height for critical reservation calculationDarrick J. Wong
Compute the actual maximum AG btree height for deciding if a per-AG block reservation is critically low. This only affects the sanity check condition, since we /generally/ will trigger on the 10% threshold. This is a long-winded way of saying that we're removing one more usage of XFS_BTREE_MAXLEVELS. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-10-19xfs: rename m_ag_maxlevels to m_allocbt_maxlevelsDarrick J. Wong
Years ago when XFS was thought to be much more simple, we introduced m_ag_maxlevels to specify the maximum btree height of per-AG btrees for a given filesystem mount. Then we observed that inode btrees don't actually have the same height and split that off; and now we have rmap and refcount btrees with much different geometries and separate maxlevels variables. The 'ag' part of the name doesn't make much sense anymore, so rename this to m_alloc_maxlevels to reinforce that this is the maximum height of the *free space* btrees. This sets us up for the next patch, which will add a variable to track the maximum height of all AG btrees. (Also take the opportunity to improve adjacent comments and fix minor style problems.) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-19xfs: replace XFS_FORCED_SHUTDOWN with xfs_is_shutdownDave Chinner
Remove the shouty macro and instead use the inline function that matches other state/feature check wrapper naming. This conversion was done with sed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: convert remaining mount flags to state flagsDave Chinner
The remaining mount flags kept in m_flags are actually runtime state flags. These change dynamically, so they really should be updated atomically so we don't potentially lose an update due to racing modifications. Convert these remaining flags to be stored in m_opstate and use atomic bitops to set and clear the flags. This also adds a couple of simple wrappers for common state checks - read only and shutdown. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: convert mount flags to featuresDave Chinner
Replace m_flags feature checks with xfs_has_<feature>() calls and rework the setup code to set flags in m_features. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: consolidate mount option features in m_featuresDave Chinner
This provides separation of mount time feature flags from runtime mount flags and mount option state. It also makes the feature checks use the same interface as the superblock features. i.e. we don't care if the feature is enabled by superblock flags or mount options, we just care if it's enabled or not. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: replace xfs_sb_version checks with feature flag checksDave Chinner
Convert the xfs_sb_version_hasfoo() to checks against mp->m_features. Checks of the superblock itself during disk operations (e.g. in the read/write verifiers and the to/from disk formatters) are not converted - they operate purely on the superblock state. Everything else should use the mount features. Large parts of this conversion were done with sed with commands like this: for f in `git grep -l xfs_sb_version_has fs/xfs/*.c`; do sed -i -e 's/xfs_sb_version_has\(.*\)(&\(.*\)->m_sb)/xfs_has_\1(\2)/' $f done With manual cleanups for things like "xfs_has_extflgbit" and other little inconsistencies in naming. The result is ia lot less typing to check features and an XFS binary size reduced by a bit over 3kB: $ size -t fs/xfs/built-in.a text data bss dec hex filenam before 1130866 311352 484 1442702 16038e (TOTALS) after 1127727 311352 484 1439563 15f74b (TOTALS) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-19xfs: reflect sb features in xfs_mountDave Chinner
Currently on-disk feature checks require decoding the superblock fileds and so can be non-trivial. We have almost 400 hundred individual feature checks in the XFS code, so this is a significant amount of code. To reduce runtime check overhead, pre-process all the version flags into a features field in the xfs_mount at mount time so we can convert all the feature checks to a simple flag check. There is also a need to convert the dynamic feature flags to update the m_features field. This is required for attr, attr2 and quota features. New xfs_mount based wrappers are added for this. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-18xfs: add trace point for fs shutdownDarrick J. Wong
Add a tracepoint for fs shutdowns so we can capture that in ftrace output. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-08-16xfs: move the CIL workqueue to the CILDave Chinner
We only use the CIL workqueue in the CIL, so it makes no sense to hang it off the xfs_mount and have to walk multiple pointers back up to the mount when we have the CIL structures right there. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-09xfs: allow setting and clearing of log incompat feature flagsDarrick J. Wong
Log incompat feature flags in the superblock exist for one purpose: to protect the contents of a dirty log from replay on a kernel that isn't prepared to handle those dirty contents. This means that they can be cleared if (a) we know the log is clean and (b) we know that there aren't any other threads in the system that might be setting or relying upon a log incompat flag. Therefore, clear the log incompat flags when we've finished recovering the log, when we're unmounting cleanly, remounting read-only, or freezing; and provide a function so that subsequent patches can start using this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com>
2021-08-09xfs: throttle inode inactivation queuing on memory reclaimDarrick J. Wong
Now that we defer inode inactivation, we've decoupled the process of unlinking or closing an inode from the process of inactivating it. In theory this should lead to better throughput since we now inactivate the queued inodes in batches instead of one at a time. Unfortunately, one of the primary risks with this decoupling is the loss of rate control feedback between the frontend and background threads. In other words, a rm -rf /* thread can run the system out of memory if it can queue inodes for inactivation and jump to a new CPU faster than the background threads can actually clear the deferred work. The workers can get scheduled off the CPU if they have to do IO, etc. To solve this problem, we configure a shrinker so that it will activate the /second/ time the shrinkers are called. The custom shrinker will queue all percpu deferred inactivation workers immediately and set a flag to force frontend callers who are releasing a vfs inode to wait for the inactivation workers. On my test VM with 560M of RAM and a 2TB filesystem, this seems to solve most of the OOMing problem when deleting 10 million inodes. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-09xfs: don't run speculative preallocation gc when fs is frozenDarrick J. Wong
Now that we have the infrastructure to switch background workers on and off at will, fix the block gc worker code so that we don't actually run the worker when the filesystem is frozen, same as we do for deferred inactivation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-09xfs: queue inactivation immediately when free realtime extents are tightDarrick J. Wong
Now that we have made the inactivation of unlinked inodes a background task to increase the throughput of file deletions, we need to be a little more careful about how long of a delay we can tolerate. Similar to the patch doing this for free space on the data device, if the file being inactivated is a realtime file and the realtime volume is running low on free extents, we want to run the worker ASAP so that the realtime allocator can make better decisions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-09xfs: queue inactivation immediately when free space is tightDarrick J. Wong
Now that we have made the inactivation of unlinked inodes a background task to increase the throughput of file deletions, we need to be a little more careful about how long of a delay we can tolerate. On a mostly empty filesystem, the risk of the allocator making poor decisions due to fragmentation of the free space on account a lengthy delay in background updates is minimal because there's plenty of space. However, if free space is tight, we want to deallocate unlinked inodes as quickly as possible to avoid fallocate ENOSPC and to give the allocator the best shot at optimal allocations for new writes. Therefore, queue the percpu worker immediately if the filesystem is more than 95% full. This follows the same principle that XFS becomes less aggressive about speculative allocations and lazy cleanup (and more precise about accounting) when nearing full. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2021-08-06xfs: per-cpu deferred inode inactivation queuesDave Chinner
Move inode inactivation to background work contexts so that it no longer runs in the context that releases the final reference to an inode. This will allow process work that ends up blocking on inactivation to continue doing work while the filesytem processes the inactivation in the background. A typical demonstration of this is unlinking an inode with lots of extents. The extents are removed during inactivation, so this blocks the process that unlinked the inode from the directory structure. By moving the inactivation to the background process, the userspace applicaiton can keep working (e.g. unlinking the next inode in the directory) while the inactivation work on the previous inode is done by a different CPU. The implementation of the queue is relatively simple. We use a per-cpu lockless linked list (llist) to queue inodes for inactivation without requiring serialisation mechanisms, and a work item to allow the queue to be processed by a CPU bound worker thread. We also keep a count of the queue depth so that we can trigger work after a number of deferred inactivations have been queued. The use of a bound workqueue with a single work depth allows the workqueue to run one work item per CPU. We queue the work item on the CPU we are currently running on, and so this essentially gives us affine per-cpu worker threads for the per-cpu queues. THis maintains the effective CPU affinity that occurs within XFS at the AG level due to all objects in a directory being local to an AG. Hence inactivation work tends to run on the same CPU that last accessed all the objects that inactivation accesses and this maintains hot CPU caches for unlink workloads. A depth of 32 inodes was chosen to match the number of inodes in an inode cluster buffer. This hopefully allows sequential allocation/unlink behaviours to defering inactivation of all the inodes in a single cluster buffer at a time, further helping maintain hot CPU and buffer cache accesses while running inactivations. A hard per-cpu queue throttle of 256 inode has been set to avoid runaway queuing when inodes that take a long to time inactivate are being processed. For example, when unlinking inodes with large numbers of extents that can take a lot of processing to free. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: tweak comments and tracepoints, convert opflags to state bits] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-08-06xfs: introduce all-mounts list for cpu hotplug notificationsDave Chinner
The inode inactivation and CIL tracking percpu structures are per-xfs_mount structures. That means when we get a CPU dead notification, we need to then iterate all the per-cpu structure instances to process them. Rather than keeping linked lists of per-cpu structures in each subsystem, add a list of all xfs_mounts that the generic xfs_cpu_dead() function will iterate and call into each subsystem appropriately. This allows us to handle both per-mount and global XFS percpu state from xfs_cpu_dead(), and avoids the need to link subsystem structures that can be easily found from the xfs_mount into their own global lists. Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: expand some comments about mount list setup ordering rules] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: move perag structure and setup to libxfs/xfs_ag.[ch]Dave Chinner
Move the xfs_perag infrastructure to the libxfs files that contain all the per AG infrastructure. This helps set up for passing perags around all the code instead of bare agnos with minimal extra includes for existing files. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-06-02xfs: prepare for moving perag definitions and support to libxfsDave Chinner
The perag structures really need to be defined with the rest of the AG support infrastructure. The struct xfs_perag and init/teardown has been placed in xfs_mount.[ch] because there are differences in the structure between kernel and userspace. Mainly that userspace doesn't have a lot of the internal stuff that the kernel has for caches and discard and other such structures. However, it makes more sense to move this to libxfs than to keep this separation because we are now moving to use struct perags everywhere in the code instead of passing raw agnumber_t values about. Hence we shoudl really move the support infrastructure to libxfs/xfs_ag.[ch]. To do this without breaking userspace, first we need to rearrange the structures and code so that all the kernel specific code is located together. This makes it simple for userspace to ifdef out the all the parts it does not need, minimising the code differences between kernel and userspace. The next commit will do the move... Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
2021-04-29xfs: introduce in-core global counter of allocbt blocksBrian Foster
Introduce an in-core counter to track the sum of all allocbt blocks used by the filesystem. This value is currently tracked per-ag via the ->agf_btreeblks field in the AGF, which also happens to include rmapbt blocks. A global, in-core count of allocbt blocks is required to identify the subset of global ->m_fdblocks that consists of unavailable blocks currently used for allocation btrees. To support this calculation at block reservation time, construct a similar global counter for allocbt blocks, populate it on first read of each AGF and update it as allocbt blocks are used and released. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2021-03-25xfs: rename the blockgc workqueueDarrick J. Wong
Since we're about to start using the blockgc workqueue to dispose of inactivated inodes, strip the "block" prefix from the name; now it's merely the general garbage collection (gc) workqueue. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: parallelize block preallocation garbage collectionDarrick J. Wong
Split the block preallocation garbage collection work into per-AG work items so that we can take advantage of parallelization. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-02-03xfs: consolidate the eofblocks and cowblocks workersDarrick J. Wong
Remove the separate cowblocks work items and knob so that we can control and run everything from a single blockgc work queue. Note that the speculative_prealloc_lifetime sysfs knob retains its historical name even though the functions move to prefix xfs_blockgc_*. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>
2021-01-22xfs: fold sbcount quiesce logging into log coveringBrian Foster
xfs_log_sbcount() calls xfs_sync_sb() to sync superblock counters to disk when lazy superblock accounting is enabled. This occurs on unmount, freeze, and read-only (re)mount and ensures the final values are calculated and persisted to disk before each form of quiesce completes. Now that log covering occurs in all of these contexts and uses the same xfs_sync_sb() mechanism to update log state, there is no need to log the superblock separately for any reason. Update the log quiesce path to sync the superblock at least once for any mount where lazy superblock accounting is enabled. If the log is already covered, it will remain in the covered state. Otherwise, the next sync as part of the normal covering sequence will carry the associated superblock update with it. Remove xfs_log_sbcount() now that it is no longer needed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
2020-09-15xfs: remove xfs_getsbChristoph Hellwig
Merge xfs_getsb into its only caller, and clean that one up a little bit as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-07-07xfs: allow multiple reclaimers per AGDave Chinner
Inode reclaim will still throttle direct reclaim on the per-ag reclaim locks. This is no longer necessary as reclaim can run non-blocking now. Hence we can remove these locks so that we don't arbitrarily block reclaimers just because there are more direct reclaimers than there are AGs. This can result in multiple reclaimers working on the same range of an AG, but this doesn't cause any apparent issues. Optimising the spread of concurrent reclaimers for best efficiency can be done in a future patchset. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: remove the m_active_trans counterDave Chinner
It's a global atomic counter, and we are hitting it at a rate of half a million transactions a second, so it's bouncing the counter cacheline all over the place on large machines. We don't actually need it anymore - it used to be required because the VFS freeze code could not track/prevent filesystem transactions that were running, but that problem no longer exists. Hence to remove the counter, we simply have to ensure that nothing calls xfs_sync_sb() while we are trying to quiesce the filesytem. That only happens if the log worker is still running when we call xfs_quiesce_attr(). The log worker is cancelled at the end of xfs_quiesce_attr() by calling xfs_log_quiesce(), so just call it early here and then we can remove the counter altogether. Concurrent create, 50 million inodes, identical 16p/16GB virtual machines on different physical hosts. Machine A has twice the CPU cores per socket of machine B: unpatched patched machine A: 3m16s 2m00s machine B: 4m04s 4m05s Create rates: unpatched patched machine A: 282k+/-31k 468k+/-21k machine B: 231k+/-8k 233k+/-11k Concurrent rm of same 50 million inodes: unpatched patched machine A: 6m42s 2m33s machine B: 4m47s 4m47s The transaction rate on the fast machine went from just under 300k/sec to 700k/sec, which indicates just how much of a bottleneck this atomic counter was. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: separate read-only variables in struct xfs_mountDave Chinner
Seeing massive cpu usage from xfs_agino_range() on one machine; instruction level profiles look similar to another machine running the same workload, only one machine is consuming 10x as much CPU as the other and going much slower. The only real difference between the two machines is core count per socket. Both are running identical 16p/16GB virtual machine configurations Machine A: 25.83% [k] xfs_agino_range 12.68% [k] __xfs_dir3_data_check 6.95% [k] xfs_verify_ino 6.78% [k] xfs_dir2_data_entry_tag_p 3.56% [k] xfs_buf_find 2.31% [k] xfs_verify_dir_ino 2.02% [k] xfs_dabuf_map.constprop.0 1.65% [k] xfs_ag_block_count And takes around 13 minutes to remove 50 million inodes. Machine B: 13.90% [k] __pv_queued_spin_lock_slowpath 3.76% [k] do_raw_spin_lock 2.83% [k] xfs_dir3_leaf_check_int 2.75% [k] xfs_agino_range 2.51% [k] __raw_callee_save___pv_queued_spin_unlock 2.18% [k] __xfs_dir3_data_check 2.02% [k] xfs_log_commit_cil And takes around 5m30s to remove 50 million inodes. Suspect is cacheline contention on m_sectbb_log which is used in one of the macros in xfs_agino_range. This is a read-only variable but shares a cacheline with m_active_trans which is a global atomic that gets bounced all around the machine. The workload is trying to run hundreds of thousands of transactions per second and hence cacheline contention will be occurring on this atomic counter. Hence xfs_agino_range() is likely just be an innocent bystander as the cache coherency protocol fights over the cacheline between CPU cores and sockets. On machine A, this rearrangement of the struct xfs_mount results in the profile changing to: 9.77% [kernel] [k] xfs_agino_range 6.27% [kernel] [k] __xfs_dir3_data_check 5.31% [kernel] [k] __pv_queued_spin_lock_slowpath 4.54% [kernel] [k] xfs_buf_find 3.79% [kernel] [k] do_raw_spin_lock 3.39% [kernel] [k] xfs_verify_ino 2.73% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock Vastly less CPU usage in xfs_agino_range(), but still 3x the amount of machine B and still runs substantially slower than it should. Current rm -rf of 50 million files: vanilla patched machine A 13m20s 6m42s machine B 5m30s 5m02s It's an improvement, hence indicating that separation and further optimisation of read-only global filesystem data is worthwhile, but it clearly isn't the underlying issue causing this specific performance degradation. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-27xfs: reduce free inode accounting overheadDave Chinner
Shaokun Zhang reported that XFS was using substantial CPU time in percpu_count_sum() when running a single threaded benchmark on a high CPU count (128p) machine from xfs_mod_ifree(). The issue is that the filesystem is empty when the benchmark runs, so inode allocation is running with a very low inode free count. With the percpu counter batching, this means comparisons when the counter is less that 128 * 256 = 32768 use the slow path of adding up all the counters across the CPUs, and this is expensive on high CPU count machines. The summing in xfs_mod_ifree() is only used to fire an assert if an underrun occurs. The error is ignored by the higher level code. Hence this is really just debug code and we don't need to run it on production kernels, nor do we need such debug checks to return error values just to trigger an assert. Finally, xfs_mod_icount/xfs_mod_ifree are only called from xfs_trans_unreserve_and_mod_sb(), so get rid of them and just directly call the percpu_counter_add/percpu_counter_compare functions. The compare functions are now run only on debug builds as they are internal to ASSERT() checks and so only compiled in when ASSERTs are active (CONFIG_XFS_DEBUG=y or CONFIG_XFS_WARN=y). Reported-by: Shaokun Zhang <zhangshaokun@hisilicon.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-07xfs: remove unused shutdown typesBrian Foster
Both types control shutdown messaging and neither is used in the current codebase. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04fs/xfs: Make DAX mount option a tri-stateIra Weiny
As agreed upon[1]. We make the dax mount option a tri-state. '-o dax' continues to operate the same. We add 'always', 'never', and 'inode' (default). [1] https://lore.kernel.org/lkml/20200405061945.GA94792@iweiny-DESK2.sc.intel.com/ Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-05-04fs/xfs: Change XFS_MOUNT_DAX to XFS_MOUNT_DAX_ALWAYSIra Weiny
In prep for the new tri-state mount option which then introduces XFS_MOUNT_DAX_NEVER. Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2020-04-16xfs: move inode flush to the sync workqueuexfs-5.7-fixes-3Darrick J. Wong
Move the inode dirty data flushing to a workqueue so that multiple threads can take advantage of a single thread's flushing work. The ratelimiting technique used in bdd4ee4 was not successful, because threads that skipped the inode flush scan due to ratelimiting would ENOSPC early, which caused occasional (but noticeable) changes in behavior and sporadic fstest regressions. Therefore, make all the writer threads wait on a single inode flush, which eliminates both the stampeding hordes of flushers and the small window in which a write could fail with ENOSPC because it lost the ratelimit race after even another thread freed space. Fixes: c6425702f21e ("xfs: ratelimit inode flush on buffered write ENOSPC") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2020-03-31xfs: ratelimit inode flush on buffered write ENOSPCDarrick J. Wong
A customer reported rcu stalls and softlockup warnings on a computer with many CPU cores and many many more IO threads trying to write to a filesystem that is totally out of space. Subsequent analysis pointed to the many many IO threads calling xfs_flush_inodes -> sync_inodes_sb, which causes a lot of wb_writeback_work to be queued. The writeback worker spends so much time trying to wake the many many threads waiting for writeback completion that it trips the softlockup detector, and (in this case) the system automatically reboots. In addition, they complain that the lengthy xfs_flush_inodes scan traps all of those threads in uninterruptible sleep, which hampers their ability to kill the program or do anything else to escape the situation. If there's thousands of threads trying to write to files on a full filesystem, each of those threads will start separate copies of the inode flush scan. This is kind of pointless since we only need one scan, so rate limit the inode flush. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
2019-11-13xfs: remove unused structure members & simple typedefsEric Sandeen
Remove some unused typedef'd simple types, and some unused structure members. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13xfs: devirtualize ->m_dirnameopsChristoph Hellwig
Instead of causing a relatively expensive indirect call for each hashing and comparism of a file name in a directory just use an inline function and a simple branch on the ASCII CI bit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: fix unused variable warning] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-13xfs: remove the unused m_chsize fieldChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10xfs: remove the now unused dir ops infrastructureChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-10xfs: move the node header size to struct xfs_da_geometryChristoph Hellwig
Move the node header size field to struct xfs_da_geometry, and remove the now unused non-directory dir ops infrastructure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05xfs: use super s_id instead of struct xfs_mount m_fsnameIan Kent
Eliminate struct xfs_mount field m_fsname by using the super block s_id field directly. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-11-05xfs: remove unused struct xfs_mount field m_fsname_lenIan Kent
The struct xfs_mount field m_fsname_len is not used anywhere, remove it. Signed-off-by: Ian Kent <raven@themaw.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: reverse the polarity of XFS_MOUNT_COMPAT_IOSIZEChristoph Hellwig
Replace XFS_MOUNT_COMPAT_IOSIZE with an inverted XFS_MOUNT_LARGEIO flag that makes the usage more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: rename the XFS_MOUNT_DFLT_IOSIZE option toChristoph Hellwig
Make the flag match the mount option and usage. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: simplify parsing of allocsize mount optionChristoph Hellwig
Rework xfs_parseargs to fill out the default value and then parse the option directly into the mount structure, similar to what we do for other updates, and open code the now trivial updates based on on the on-disk superblock directly into xfs_mountfs. Note that this change rejects the allocsize=0 mount option that has been documented as invalid for a long time instead of just ignoring it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: rename the m_writeio_* fields in struct xfs_mountChristoph Hellwig
Use the allocsize name to match the mount option and usage instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: remove the m_readio_* fields in struct xfs_mountChristoph Hellwig
m_readio_blocks is entirely unused, and m_readio_blocks is only used in xfs_stat_blksize in a max statements that is a no-op as it always has the same value as m_writeio_log. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: don't use a different allocsice for -o wsyncChristoph Hellwig
The -o wsync allocsize overwrite overwrite was part of a special hack for NFSv2 servers in IRIX and has no real purpose in modern Linux, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-10-29xfs: cleanup calculating the stat optimal I/O sizeChristoph Hellwig
Move xfs_preferred_iosize to xfs_iops.c, unobsfucate it and also handle the realtime special case in the helper. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-08-26xfs: add kmem allocation trace pointsDave Chinner
When trying to correlate XFS kernel allocations to memory reclaim behaviour, it is useful to know what allocations XFS is actually attempting. This information is not directly available from tracepoints in the generic memory allocation and reclaim tracepoints, so these new trace points provide a high level indication of what the XFS memory demand actually is. There is no per-filesystem context in this code, so we just trace the type of allocation, the size and the allocation constraints. The kmem code also doesn't include much of the common XFS headers, so there are a few definitions that need to be added to the trace headers and a couple of types that need to be made common to avoid needing to include the whole world in the kmem code. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2019-06-28xfs: move the log ioend workqueue to struct xlogChristoph Hellwig
Move the workqueue used for log I/O completions from struct xfs_mount to struct xlog to keep it self contained in the log code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: destroy the log workqueue after ensuring log ios are done] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>