summaryrefslogtreecommitdiff
path: root/fs/xfs/xfs_trace.h
AgeCommit message (Collapse)Author
2017-02-24mm,fs,dax: change ->pmd_fault to ->huge_faultDave Jiang
Patch series "1G transparent hugepage support for device dax", v2. The following series implements support for 1G trasparent hugepage on x86 for device dax. The bulk of the code was written by Mathew Wilcox a while back supporting transparent 1G hugepage for fs DAX. I have forward ported the relevant bits to 4.10-rc. The current submission has only the necessary code to support device DAX. Comments from Dan Williams: So the motivation and intended user of this functionality mirrors the motivation and users of 1GB page support in hugetlbfs. Given expected capacities of persistent memory devices an in-memory database may want to reduce tlb pressure beyond what they can already achieve with 2MB mappings of a device-dax file. We have customer feedback to that effect as Willy mentioned in his previous version of these patches [1]. [1]: https://lkml.org/lkml/2016/1/31/52 Comments from Nilesh @ Oracle: There are applications which have a process model; and if you assume 10,000 processes attempting to mmap all the 6TB memory available on a server; we are looking at the following: processes : 10,000 memory : 6TB pte @ 4k page size: 8 bytes / 4K of memory * #processes = 6TB / 4k * 8 * 10000 = 1.5GB * 80000 = 120,000GB pmd @ 2M page size: 120,000 / 512 = ~240GB pud @ 1G page size: 240GB / 512 = ~480MB As you can see with 2M pages, this system will use up an exorbitant amount of DRAM to hold the page tables; but the 1G pages finally brings it down to a reasonable level. Memory sizes will keep increasing; so this number will keep increasing. An argument can be made to convert the applications from process model to thread model, but in the real world that may not be always practical. Hopefully this helps explain the use case where this is valuable. This patch (of 3): In preparation for adding the ability to handle PUD pages, convert vm_operations_struct.pmd_fault to vm_operations_struct.huge_fault. The vm_fault structure is extended to include a union of the different page table pointers that may be needed, and three flag bits are reserved to indicate which type of pointer is in the union. [ross.zwisler@linux.intel.com: remove unused function ext4_dax_huge_fault()] Link: http://lkml.kernel.org/r/1485813172-7284-1-git-send-email-ross.zwisler@linux.intel.com [dave.jiang@intel.com: clear PMD or PUD size flags when in fall through path] Link: http://lkml.kernel.org/r/148589842696.5820.16078080610311444794.stgit@djiang5-desk3.ch.intel.com Link: http://lkml.kernel.org/r/148545058784.17912.6353162518188733642.stgit@djiang5-desk3.ch.intel.com Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com> Signed-off-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Nilesh Choudhury <nilesh.choudhury@oracle.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-06xfs: allocate direct I/O COW blocks in iomap_beginChristoph Hellwig
Instead of preallocating all the required COW blocks in the high-level write code do it inside the iomap code, like we do for all other I/O. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-06xfs: reject all unaligned direct writes to reflinked filesChristoph Hellwig
We currently fall back from direct to buffered writes if we detect a remaining shared extent in the iomap_begin callback. But by the time iomap_begin is called for the potentially unaligned end block we might have already written most of the data to disk, which we'd now write again using buffered I/O. To avoid this reject all writes to reflinked files before starting I/O so that we are guaranteed to only write the data once. The alternative would be to unshare the unaligned start and/or end block before doing the I/O. I think that's doable, and will actually be required to support reflinks on DAX file system. But it will take a little more time and I'd rather get rid of the double write ASAP. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-02-02xfs: mark speculative prealloc CoW fork extents unwrittenDarrick J. Wong
Christoph Hellwig pointed out that there's a potentially nasty race when performing simultaneous nearby directio cow writes: "Thread 1 writes a range from B to c " B --------- C p "a little later thread 2 writes from A to B " A --------- B p [editor's note: the 'p' denote cowextsize boundaries, which I added to make this more clear] "but the code preallocates beyond B into the range where thread "1 has just written, but ->end_io hasn't been called yet. "But once ->end_io is called thread 2 has already allocated "up to the extent size hint into the write range of thread 1, "so the end_io handler will splice the unintialized blocks from "that preallocation back into the file right after B." We can avoid this race by ensuring that thread 1 cannot accidentally remap the blocks that thread 2 allocated (as part of speculative preallocation) as part of t2's write preparation in t1's end_io handler. The way we make this happen is by taking advantage of the unwritten extent flag as an intermediate step. Recall that when we begin the process of writing data to shared blocks, we create a delayed allocation extent in the CoW fork: D: --RRRRRRSSSRRRRRRRR--- C: ------DDDDDDD--------- When a thread prepares to CoW some dirty data out to disk, it will now convert the delalloc reservation into an /unwritten/ allocated extent in the cow fork. The da conversion code tries to opportunistically allocate as much of a (speculatively prealloc'd) extent as possible, so we may end up allocating a larger extent than we're actually writing out: D: --RRRRRRSSSRRRRRRRR--- U: ------UUUUUUU--------- Next, we convert only the part of the extent that we're actively planning to write to normal (i.e. not unwritten) status: D: --RRRRRRSSSRRRRRRRR--- U: ------UURRUUU--------- If the write succeeds, the end_cow function will now scan the relevant range of the CoW fork for real extents and remap only the real extents into the data fork: D: --RRRRRRRRSRRRRRRRR--- U: ------UU--UUU--------- This ensures that we never obliterate valid data fork extents with unwritten blocks from the CoW fork. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-01-30xfs: remove unused struct declarationsEric Sandeen
After scratching my head looking for "xfs_busy_extent" I realized it's not used; it's xfs_extent_busy, and the declaration for the other name is bogus. Remove that and a few others as well. (struct xfs_log_callback is used, but the 2nd declaration is unnecessary). Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2016-12-09xfs: nuke unused tracepoint definitionsEric Sandeen
This is all unused code, so remove it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20xfs: optimize xfs_reflink_end_cowChristoph Hellwig
Instead of doing a full extent list search for each extent that is to be deleted using xfs_bmapi_read and then doing another one inside of xfs_bunmapi_cow use the same scheme that xfs_bumapi uses: look up the last extent to be deleted and then use the extent index to walk downward until we are outside the range to be deleted. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-20xfs: optimize writes to reflink filesChristoph Hellwig
Instead of reserving space as the first thing in write_begin move it past reading the extent in the data fork. That way we only have to read from the data fork once and can reuse that information for trimming the extent to the shared/unshared boundary. Additionally this allows to easily limit the actual write size to said boundary, and avoid a roundtrip on the ilock. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-10-13Merge tag 'xfs-reflink-for-linus-4.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs < XFS has gained super CoW powers! > ---------------------------------- \ ^__^ \ (oo)\_______ (__)\ )\/\ ||----w | || || Pull XFS support for shared data extents from Dave Chinner: "This is the second part of the XFS updates for this merge cycle. This pullreq contains the new shared data extents feature for XFS. Given the complexity and size of this change I am expecting - like the addition of reverse mapping last cycle - that there will be some follow-up bug fixes and cleanups around the -rc3 stage for issues that I'm sure will show up once the code hits a wider userbase. What it is: At the most basic level we are simply adding shared data extents to XFS - i.e. a single extent on disk can now have multiple owners. To do this we have to add new on-disk features to both track the shared extents and the number of times they've been shared. This is done by the new "refcount" btree that sits in every allocation group. When we share or unshare an extent, this tree gets updated. Along with this new tree, the reverse mapping tree needs to be updated to track each owner or a shared extent. This also needs to be updated ever share/unshare operation. These interactions at extent allocation and freeing time have complex ordering and recovery constraints, so there's a significant amount of new intent-based transaction code to ensure that operations are performed atomically from both the runtime and integrity/crash recovery perspectives. We also need to break sharing when writes hit a shared extent - this is where the new copy-on-write implementation comes in. We allocate new storage and copy the original data along with the overwrite data into the new location. We only do this for data as we don't share metadata at all - each inode has it's own metadata that tracks the shared data extents, the extents undergoing CoW and it's own private extents. Of course, being XFS, nothing is simple - we use delayed allocation for CoW similar to how we use it for normal writes. ENOSPC is a significant issue here - we build on the reservation code added in 4.8-rc1 with the reverse mapping feature to ensure we don't get spurious ENOSPC issues part way through a CoW operation. These mechanisms also help minimise fragmentation due to repeated CoW operations. To further reduce fragmentation overhead, we've also introduced a CoW extent size hint, which indicates how large a region we should allocate when we execute a CoW operation. With all this functionality in place, we can hook up .copy_file_range, .clone_file_range and .dedupe_file_range and we gain all the capabilities of reflink and other vfs provided functionality that enable manipulation to shared extents. We also added a fallocate mode that explicitly unshares a range of a file, which we implemented as an explicit CoW of all the shared extents in a file. As such, it's a huge chunk of new functionality with new on-disk format features and internal infrastructure. It warns at mount time as an experimental feature and that it may eat data (as we do with all new on-disk features until they stabilise). We have not released userspace suport for it yet - userspace support currently requires download from Darrick's xfsprogs repo and build from source, so the access to this feature is really developer/tester only at this point. Initial userspace support will be released at the same time the kernel with this code in it is released. The new code causes 5-6 new failures with xfstests - these aren't serious functional failures but things the output of tests changing slightly due to perturbations in layouts, space usage, etc. OTOH, we've added 150+ new tests to xfstests that specifically exercise this new functionality so it's got far better test coverage than any functionality we've previously added to XFS. Darrick has done a pretty amazing job getting us to this stage, and special mention also needs to go to Christoph (review, testing, improvements and bug fixes) and Brian (caught several intricate bugs during review) for the effort they've also put in. Summary: - unshare range (FALLOC_FL_UNSHARE) support for fallocate - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr interface - shared extent support for XFS - copy-on-write support for shared extents - copy_file_range support - clone_file_range support (implements reflink) - dedupe_file_range support - defrag support for reverse mapping enabled filesystems" * tag 'xfs-reflink-for-linus-4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (71 commits) xfs: convert COW blocks to real blocks before unwritten extent conversion xfs: rework refcount cow recovery error handling xfs: clear reflink flag if setting realtime flag xfs: fix error initialization xfs: fix label inaccuracies xfs: remove isize check from unshare operation xfs: reduce stack usage of _reflink_clear_inode_flag xfs: check inode reflink flag before calling reflink functions xfs: implement swapext for rmap filesystems xfs: refactor swapext code xfs: various swapext cleanups xfs: recognize the reflink feature bit xfs: simulate per-AG reservations being critically low xfs: don't mix reflink and DAX mode for now xfs: check for invalid inode reflink flags xfs: set a default CoW extent size of 32 blocks xfs: convert unwritten status of reverse mappings for shared files xfs: use interval query for rmap alloc operations on shared files xfs: add shared rmap map/unmap/convert log item types xfs: increase log reservations for reflink ...
2016-10-07Merge branch 'work.splice_read' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS splice updates from Al Viro: "There's a bunch of branches this cycle, both mine and from other folks and I'd rather send pull requests separately. This one is the conversion of ->splice_read() to ITER_PIPE iov_iter (and introduction of such). Gets rid of a lot of code in fs/splice.c and elsewhere; there will be followups, but these are for the next cycle... Some pipe/splice-related cleanups from Miklos in the same branch as well" * 'work.splice_read' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: pipe: fix comment in pipe_buf_operations pipe: add pipe_buf_steal() helper pipe: add pipe_buf_confirm() helper pipe: add pipe_buf_release() helper pipe: add pipe_buf_get() helper relay: simplify relay_file_read() switch default_file_splice_read() to use of pipe-backed iov_iter switch generic_file_splice_read() to use of ->read_iter() new iov_iter flavour: pipe-backed fuse_dev_splice_read(): switch to add_to_pipe() skb_splice_bits(): get rid of callback new helper: add_to_pipe() splice: lift pipe_lock out of splice_to_pipe() splice: switch get_iovec_page_array() to iov_iter splice_to_pipe(): don't open-code wakeup_pipe_readers() consistent treatment of EFAULT on O_DIRECT read/write
2016-10-05xfs: implement swapext for rmap filesystemsDarrick J. Wong
Implement swapext for filesystems that have reverse mapping. Back in the reflink patches, we augmented the bmap code with a 'REMAP' flag that updates only the bmbt and doesn't touch the allocator and implemented log redo items for those two operations. Now we can rewrite extent swapping as a (looong) series of remap operations. This is far less efficient than the fork swapping method implemented in the past, so we only switch this on for rmap. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: use interval query for rmap alloc operations on shared filesDarrick J. Wong
When it's possible for reverse mappings to overlap (data fork extents of files on reflink filesystems), use the interval query function to find the left neighbor of an extent we're trying to add; and be careful to use the lookup functions to update the neighbors and/or add new extents. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: garbage collect old cowextsz reservationsDarrick J. Wong
Trim CoW reservations made on behalf of a cowextsz hint if they get too old or we run low on quota, so long as we don't have dirty data awaiting writeback or directio operations in progress. Garbage collection of the cowextsize extents are kept separate from prealloc extent reaping because setting the CoW prealloc lifetime to a (much) higher value than the regular prealloc extent lifetime has been useful for combatting CoW fragmentation on VM hosts where the VMs experience bursty write behaviors and we can keep the utilization ratios low enough that we don't start to run out of space. IOWs, it benefits us to keep the CoW fork reservations around for as long as we can unless we run out of blocks or hit inode reclaim. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: store in-progress CoW allocations in the refcount btreeDarrick J. Wong
Due to the way the CoW algorithm in XFS works, there's an interval during which blocks allocated to handle a CoW can be lost -- if the FS goes down after the blocks are allocated but before the block remapping takes place. This is exacerbated by the cowextsz hint -- allocated reservations can sit around for a while, waiting to get used. Since the refcount btree doesn't normally store records with refcount of 1, we can use it to record these in-progress extents. In-progress blocks cannot be shared because they're not user-visible, so there shouldn't be any conflicts with other programs. This is a better solution than holding EFIs during writeback because (a) EFIs can't be relogged currently, (b) even if they could, EFIs are bound by available log space, which puts an unnecessary upper bound on how much CoW we can have in flight, and (c) we already have a mechanism to track blocks. At mount time, read the refcount records and free anything we find with a refcount of 1 because those were in-progress when the FS went down. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-05xfs: implement CoW for directio writesDarrick J. Wong
For O_DIRECT writes to shared blocks, we have to CoW them just like we would with buffered writes. For writes that are not block-aligned, just bounce them to the page cache. For block-aligned writes, however, we can do better than that. Use the same mechanisms that we employ for buffered CoW to set up a delalloc reservation, allocate all the blocks at once, issue the writes against the new blocks and use the same ioend functions to remap the blocks after the write. This should be fairly performant. Christoph discovered that xfs_reflink_allocate_cow_range may stumble over invalid entries in the extent array given that it drops the ilock but still expects the index to be stable. Simple fixing it to a new lookup for every iteration still isn't correct given that xfs_bmapi_allocate will trigger a BUG_ON() if hitting a hole, and there is nothing preventing a xfs_bunmapi_cow call removing extents once we dropped the ilock either. This patch duplicates the inner loop of xfs_bmapi_allocate into a helper for xfs_reflink_allocate_cow_range so that it can be done under the same ilock critical section as our CoW fork delayed allocation. The directio CoW warts will be revisited in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2016-10-05switch generic_file_splice_read() to use of ->read_iter()Al Viro
... and kill the ->splice_read() instances that can be switched to it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-10-04xfs: support bmapping delalloc extents in the CoW forkDarrick J. Wong
Allow the creation of delayed allocation extents in the CoW fork. In a subsequent patch we'll wire up iomap_begin to actually do this via reflink helper functions. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: introduce the CoW forkDarrick J. Wong
Introduce a new in-core fork for storing copy-on-write delalloc reservations and allocated extents that are in the process of being written out. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: define tracepoints for reflink activitiesDarrick J. Wong
Define all the tracepoints we need to inspect the runtime operation of reflink/dedupe/copy-on-write. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: implement deferred bmbt map/unmap operationsDarrick J. Wong
Implement deferred versions of the inode block map/unmap functions. These will be used in subsequent patches to make reflink operations atomic. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-04xfs: map an inode's offset to an exact physical blockDarrick J. Wong
Teach the bmap routine to know how to map a range of file blocks to a specific range of physical blocks, instead of simply allocating fresh blocks. This enables reflink to map a file to blocks that are already in use. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: connect refcount adjust functions to upper layersDarrick J. Wong
Plumb in the upper level interface to schedule and finish deferred refcount operations via the deferred ops mechanism. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: log refcount intent itemsDarrick J. Wong
Provide a mechanism for higher levels to create CUI/CUD items, submit them to the log, and a stub function to deal with recovered CUI items. These parts will be connected to the refcountbt in a later patch. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: define the on-disk refcount btree formatDarrick J. Wong
Start constructing the refcount btree implementation by establishing the on-disk format and everything needed to read, write, and manipulate the refcount btree blocks. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03xfs: define tracepoints for refcount btree activitiesDarrick J. Wong
Define all the tracepoints we need to inspect the refcount btree runtime operation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2016-10-03Merge branch 'xfs-4.9-log-recovery-fixes' into for-nextxfs-for-linus-4.9-rc1Dave Chinner
2016-09-26xfs: log recovery tracepoints to track current lsn and buffer submissionBrian Foster
Log recovery has particular rules around buffer submission along with tricky corner cases where independent transactions can share an LSN. As such, it can be difficult to follow when/why buffers are submitted during recovery. Add a couple tracepoints to post the current LSN of a record when a new record is being processed and when a buffer is being skipped due to LSN ordering. Also, update the recover item class to include the LSN of the current transaction for the item being processed. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-26xfs: remote attribute blocks aren't really userdataDave Chinner
When adding a new remote attribute, we write the attribute to the new extent before the allocation transaction is committed. This means we cannot reuse busy extents as that violates crash consistency semantics. Hence we currently treat remote attribute extent allocation like userdata because it has the same overwrite ordering constraints as userdata. Unfortunately, this also allows the allocator to incorrectly apply extent size hints to the remote attribute extent allocation. This results in interesting failures, such as transaction block reservation overruns and in-memory inode attribute fork corruption. To fix this, we need to separate the busy extent reuse configuration from the userdata configuration. This changes the definition of XFS_BMAPI_METADATA slightly - it now means that allocation is metadata and reuse of busy extents is acceptible due to the metadata ordering semantics of the journal. If this flag is not set, it means the allocation is that has unordered data writeback, and hence busy extent reuse is not allowed. It no longer implies the allocation is for user data, just that the data write will not be strictly ordered. This matches the semantics for both user data and remote attribute block allocation. As such, This patch changes the "userdata" field to a "datatype" field, and adds a "no busy reuse" flag to the field. When we detect an unordered data extent allocation, we immediately set the no reuse flag. We then set the "user data" flags based on the inode fork we are allocating the extent to. Hence we only set userdata flags on data fork allocations now and consider attribute fork remote extents to be an unordered metadata extent. The result is that remote attribute extents now have the expected allocation semantics, and the data fork allocation behaviour is completely unchanged. It should be noted that there may be other ways to fix this (e.g. use ordered metadata buffers for the remote attribute extent data write) but they are more invasive and difficult to validate both from a design and implementation POV. Hence this patch takes the simple, obvious route to fixing the problem... Reported-and-tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-09-19xfs: set up per-AG free space reservationsDarrick J. Wong
One unfortunate quirk of the reference count and reverse mapping btrees -- they can expand in size when blocks are written to *other* allocation groups if, say, one large extent becomes a lot of tiny extents. Since we don't want to start throwing errors in the middle of CoWing, we need to reserve some blocks to handle future expansion. The transaction block reservation counters aren't sufficient here because we have to have a reserve of blocks in every AG, not just somewhere in the filesystem. Therefore, create two per-AG block reservation pools. One feeds the AGFL so that rmapbt expansion always succeeds, and the other feeds all other metadata so that refcountbt expansion never fails. Use the count of how many reserved blocks we need to have on hand to create a virtual reservation in the AG. Through selective clamping of the maximum length of allocation requests and of the length of the longest free extent, we can make it look like there's less free space in the AG unless the reservation owner is asking for blocks. In other words, play some accounting tricks in-core to make sure that we always have blocks available. On the plus side, there's nothing to clean up if we crash, which is contrast to the strategy that the rough draft used (actually removing extents from the freespace btrees). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-30xfs: track log done items directly in the deferred pending work itemxfs-iomap-for-linus-4.8-rc5Darrick J. Wong
Christoph reports slab corruption when a deferred refcount update aborts during _defer_finish(). The cause of this was broken log item state tracking in xfs_defer_pending -- upon an abort, _defer_trans_abort() will call abort_intent on all intent items, including the ones that have already had a done item attached. This is incorrect because each intent item has 2 refcount: the first is released when the intent item is committed to the log; and the second is released when the _done_ item is committed to the log, or by the intent creator if there is no done item. In other words, once we log the done item, responsibility for releasing the intent item's second refcount is transferred to the done item and /must not/ be performed by anything else. The dfp_committed flag should have been tracking whether or not we had a done item so that _defer_trans_abort could decide if it needs to abort the intent item, but due to a thinko this was not the case. Rip it out and track the done item directly so that we do the right thing w.r.t. intent item freeing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reported-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-17xfs: simplify xfs_file_iomap_beginChristoph Hellwig
We'll never get nimap == 0 for a successful return from xfs_bmapi_read, so don't try to handle it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: convert unwritten status of reverse mappingsDarrick J. Wong
Provide a function to convert an unwritten rmap extent to a real one and vice versa. [ dchinner: Note that this algorithm and code was derived from the existing bmapbt unwritten extent conversion code in xfs_bmap_add_extent_unwritten_real(). ] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add an extent to the rmap btreeDarrick J. Wong
Originally-From: Dave Chinner <dchinner@redhat.com> Now all the btree, free space and transaction infrastructure is in place, we can finally add the code to insert reverse mappings to the rmap btree. Freeing will be done in a separate patch, so just the addition operation can be focussed on here. [darrick: handle owner offsets when adding rmaps] [dchinner: remove remaining debug printk statements] [darrick: move unwritten bit to rm_offset] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add tracepoints for the rmap functionsDarrick J. Wong
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add rmap btree operationsDarrick J. Wong
Originally-From: Dave Chinner <dchinner@redhat.com> Implement the generic btree operations needed to manipulate rmap btree blocks. This is very similar to the per-ag freespace btree implementation, and uses the AGFL for allocation and freeing of blocks. Adapt the rmap btree to store owner offsets within each rmap record, and to handle the primary key being redefined as the tuple [agblk, owner, offset]. The expansion of the primary key is crucial to allowing multiple owners per extent. [darrick: adapt the btree ops to deal with offsets] [darrick: remove init_rec_from_key] [darrick: move unwritten bit to rm_offset] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: define the on-disk rmap btree formatDarrick J. Wong
Originally-From: Dave Chinner <dchinner@redhat.com> Now we have all the surrounding call infrastructure in place, we can start filling out the rmap btree implementation. Start with the on-disk btree format; add everything needed to read, write and manipulate rmap btree blocks. This prepares the way for adding the btree operations implementation. [darrick: record owner and offset info in rmap btree] [darrick: fork, bmbt and unwritten state in rmap btree] [darrick: flags are a separate field in xfs_rmap_irec] [darrick: calculate maxlevels separately] [darrick: move the 'unwritten' bit into unused parts of rm_offset] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: introduce rmap extent operation stubsDarrick J. Wong
Originally-From: Dave Chinner <dchinner@redhat.com> Add the stubs into the extent allocation and freeing paths that the rmap btree implementation will hook into. While doing this, add the trace points that will be used to track rmap btree extent manipulations. [darrick.wong@oracle.com: Extend the stubs to take full owner info.] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add tracepoints and error injection for deferred extent freeingDarrick J. Wong
Add a couple of tracepoints for the deferred extent free operation and a site for injecting errors while finishing the operation. This makes it easier to debug deferred ops and test log redo. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: add tracepoints for the deferred ops mechanismDarrick J. Wong
Add tracepoints for the internals of the deferred ops mechanism and tracepoint classes for clients of the dops, to make debugging easier. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: introduce interval queries on btreesDarrick J. Wong
Create a function to enable querying of btree records mapping to a range of keys. This will be used in subsequent patches to allow querying the reverse mapping btree to find the extents mapped to a range of physical blocks, though the generic code can be used for any range query. The overlapped query range function needs to use the btree get_block helper because the root block could be an inode, in which case bc_bufs[nlevels-1] will be NULL. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-08-03xfs: support btrees with overlapping intervals for keysDarrick J. Wong
On a filesystem with both reflink and reverse mapping enabled, it's possible to have multiple rmap records referring to the same blocks on disk. When overlapping intervals are possible, querying a classic btree to find all records intersecting a given interval is inefficient because we cannot use the left side of the search interval to filter out non-matching records the same way that we can use the existing btree key to filter out records coming after the right side of the search interval. This will become important once we want to use the rmap btree to rebuild BMBTs, or implement the (future) fsmap ioctl. (For the non-overlapping case, we can perform such queries trivially by starting at the left side of the interval and walking the tree until we pass the right side.) Therefore, extend the btree code to come closer to supporting intervals as a first-class record attribute. This involves widening the btree node's key space to store both the lowest key reachable via the node pointer (as the btree does now) and the highest key reachable via the same pointer and teaching the btree modifying functions to keep the highest-key records up to date. This behavior can be turned on via a new btree ops flag so that btrees that cannot store overlapping intervals don't pay the overhead costs in terms of extra code and disk format changes. When we're deleting a record in a btree that supports overlapped interval records and the deletion results in two btree blocks being joined, we defer updating the high/low keys until after all possible joining (at higher levels in the tree) have finished. At this point, the btree pointers at all levels have been updated to remove the empty blocks and we can update the low and high keys. When we're doing this, we must be careful to update the keys of all node pointers up to the root instead of stopping at the first set of keys that don't need updating. This is because it's possible for a single deletion to cause joining of multiple levels of tree, and so we need to update everything going back to the root. The diff_two_keys functions return < 0, 0, or > 0 if key1 is less than, equal to, or greater than key2, respectively. This is consistent with the rest of the kernel and the C library. In btree_updkeys(), we need to evaluate the force_all parameter before running the key diff to avoid reading uninitialized memory when we're forcing a key update. This happens when we've allocated an empty slot at level N + 1 to point to a new block at level N and we're in the process of filling out the new keys. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20Merge branch 'xfs-4.8-split-dax-dio' into for-nextDave Chinner
2016-07-20xfs: split direct I/O and DAX pathChristoph Hellwig
So far the DAX code overloaded the direct I/O code path. There is very little in common between the two, and untangling them allows to clean up both variants. As a side effect we also get separate trace points for both I/O types. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-07-20xfs: kill ioflagsChristoph Hellwig
Now that we have the direct I/O kiocb flag there is no real need to sample the value inside of XFS, and the invis flag was always just partially used and isn't worth keeping this infrastructure around for. This also splits the read tracepoint into buffered vs direct as we've done for writes a long time ago. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21Merge branch 'xfs-4.8-misc-fixes-2' into for-nextDave Chinner
2016-06-21xfs: enable buffer deadlock postmortem diagnosis via ftraceDarrick J. Wong
Create a second buf_trylock tracepoint so that we can distinguish between a successful and a failed trylock. With this piece, we can use a script to look at the ftrace output to detect buffer deadlocks. [dchinner: update to if/else as per hch's suggestion] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-06-21xfs: implement iomap based buffered write pathChristoph Hellwig
Convert XFS to use the new iomap based multipage write path. This involves implementing the ->iomap_begin and ->iomap_end methods, and switching the buffered file write, page_mkwrite and xfs_iozero paths to the new iomap helpers. With this change __xfs_get_blocks will never be used for buffered writes, and the code handling them can be removed. Based on earlier code from Dave Chinner. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bob Peterson <rpeterso@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-05-20Merge branch 'xfs-4.7-error-cfg' into for-nextDave Chinner
2016-05-20Merge branch 'xfs-4.7-misc-fixes' into for-nextDave Chinner
2016-05-18xfs: add configurable error support to metadata buffersCarlos Maiolino
With the error configuration handle for async metadata write errors in place, we can now add initial support to the IO error processing in xfs_buf_iodone_error(). Add an infrastructure function to look up the configuration handle, and rearrange the error handling to prepare the way for different error handling conigurations to be used. Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>