summaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)Author
2016-01-19Merge branch 'xfs-setxattr-promotion' into for-nextDave Chinner
2016-01-12Merge branch 'xfs-misc-fixes-for-4.5-2' into for-nextxfs-for-linus-4.5Dave Chinner
2016-01-12xfs: handle dquot buffer readahead in log recovery correctlyDave Chinner
When we do dquot readahead in log recovery, we do not use a verifier as the underlying buffer may not have dquots in it. e.g. the allocation operation hasn't yet been replayed. Hence we do not want to fail recovery because we detect an operation to be replayed has not been run yet. This problem was addressed for inodes in commit d891400 ("xfs: inode buffers may not be valid during recovery readahead") but the problem was not recognised to exist for dquots and their buffers as the dquot readahead did not have a verifier. The result of not using a verifier is that when the buffer is then next read to replay a dquot modification, the dquot buffer verifier will only be attached to the buffer if *readahead is not complete*. Hence we can read the buffer, replay the dquot changes and then add it to the delwri submission list without it having a verifier attached to it. This then generates warnings in xfs_buf_ioapply(), which catches and warns about this case. Fix this and make it handle the same readahead verifier error cases as for inode buffers by adding a new readahead verifier that has a write operation as well as a read operation that marks the buffer as not done if any corruption is detected. Also make sure we don't run readahead if the dquot buffer has been marked as cancelled by recovery. This will result in readahead either succeeding and the buffer having a valid write verifier, or readahead failing and the buffer state requiring the subsequent read to resubmit the IO with the new verifier. In either case, this will result in the buffer always ending up with a valid write verifier on it. Note: we also need to fix the inode buffer readahead error handling to mark the buffer with EIO. Brian noticed the code I copied from there wrong during review, so fix it at the same time. Add comments linking the two functions that handle readahead verifier errors together so we don't forget this behavioural link in future. cc: <stable@vger.kernel.org> # 3.12 - current Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-12xfs: inode recovery readahead can race with inode buffer creationDave Chinner
When we do inode readahead in log recovery, we do can do the readahead before we've replayed the icreate transaction that stamps the buffer with inode cores. The inode readahead verifier catches this and marks the buffer as !done to indicate that it doesn't yet contain valid inodes. In adding buffer error notification (i.e. setting b_error = -EIO at the same time as as we clear the done flag) to such a readahead verifier failure, we can then get subsequent inode recovery failing with this error: XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32 This occurs when readahead completion races with icreate item replay such as: inode readahead find buffer lock buffer submit RA io .... icreate recovery xfs_trans_get_buffer find buffer lock buffer <blocks on RA completion> ..... <ra completion> fails verifier clear XBF_DONE set bp->b_error = -EIO release and unlock buffer <icreate gains lock> icreate initialises buffer marks buffer as done adds buffer to delayed write queue releases buffer At this point, we have an initialised inode buffer that is up to date but has an -EIO state registered against it. When we finally get to recovering an inode in that buffer: inode item recovery xfs_trans_read_buffer find buffer lock buffer sees XBF_DONE is set, returns buffer sees bp->b_error is set fail log recovery! Essentially, we need xfs_trans_get_buf_map() to clear the error status of the buffer when doing a lookup. This function returns uninitialised buffers, so the buffer returned can not be in an error state and none of the code that uses this function expects b_error to be set on return. Indeed, there is an ASSERT(!bp->b_error); in the transaction case in xfs_trans_get_buf_map() that would have caught this if log recovery used transactions.... This patch firstly changes the inode readahead failure to set -EIO on the buffer, and secondly changes xfs_buf_get_map() to never return a buffer with an error state set so this first change doesn't cause unexpected log recovery failures. cc: <stable@vger.kernel.org> # 3.12 - current Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-11xfs: eliminate committed arg from xfs_bmap_finishEric Sandeen
Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the associated comments were replicated several times across the attribute code, all dealing with what to do if the transaction was or wasn't committed. And in that replicated code, an ASSERT() test of an uninitialized variable occurs in several locations: error = xfs_attr_thing(&args); if (!error) { error = xfs_bmap_finish(&args.trans, args.flist, &committed); } if (error) { ASSERT(committed); If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish, never set "committed", and then test it in the ASSERT. Fix this up by moving the committed state internal to xfs_bmap_finish, and add a new inode argument. If an inode is passed in, it is passed through to __xfs_trans_roll() and joined to the transaction there if the transaction was committed. xfs_qm_dqalloc() was a little unique in that it called bjoin rather than ijoin, but as Dave points out we can detect the committed state but checking whether (*tpp != tp). Addresses-Coverity-Id: 102360 Addresses-Coverity-Id: 102361 Addresses-Coverity-Id: 102363 Addresses-Coverity-Id: 102364 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-08xfs: bmapbt checking on debug kernels too expensiveDave Chinner
For large sparse or fragmented files, checking every single entry in the bmapbt on every operation is prohibitively expensive. Especially as such checks rarely discover problems during normal operations on high extent coutn files. Our regression tests don't tend to exercise files with hundreds of thousands to millions of extents, so mostly this isn't noticed. However, trying to run things like xfs_mdrestore of large filesystem dumps on a debug kernel quickly becomes impossible as the CPU is completely burnt up repeatedly walking the sparse file bmapbt that is generated for every allocation that is made. Hence, if the file has more than 10,000 extents, just don't bother with walking the tree to check it exhaustively. The btree code has checks that ensure that the newly inserted/removed/modified record is correctly ordered, so the entrie tree walk in thses cases has limited additional value. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-05Merge branch 'xfs-misc-fixes-for-4.5' into for-nextDave Chinner
2016-01-04xfs: introduce per-inode DAX enablementDave Chinner
Rather than just being able to turn DAX on and off via a mount option, some applications may only want to enable DAX for certain performance critical files in a filesystem. This patch introduces a new inode flag to enable DAX in the v3 inode di_flags2 field. It adds support for setting and clearing flags in the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the S_DAX inode flag appropriately when it is seen. When this flag is set on a directory, it acts as an "inherit flag". That is, inodes created in the directory will automatically inherit the on-disk inode DAX flag, enabling administrators to set up directory heirarchies that automatically use DAX. Setting this flag on an empty root directory will make the entire filesystem use DAX by default. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-01-04xfs: use FS_XFLAG definitions directlyDave Chinner
Now that the ioctls have been hoisted up to the VFS level, use the VFs definitions directly and remove the XFS specific definitions completely. Userspace is going to have to handle the change of this interface separately, so removing the definitions from xfs_fs.h is not an issue here at all. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-01-04fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotionDave Chinner
Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls can be used by all filesystems, not just XFS. This enables (initially) ext4 to use the ioctl to set project IDs on inodes. Based-on-patch-from: Li Xi <lixi@ddn.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-01-04libxfs: refactor short btree block verificationDarrick J. Wong
Create xfs_btree_sblock_verify() to verify short-format btree blocks (i.e. the per-AG btrees with 32-bit block pointers) instead of open-coding them. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correctDarrick J. Wong
Because struct xfs_agfl is 36 bytes long and has a 64-bit integer inside it, gcc will quietly round the structure size up to the nearest 64 bits -- in this case, 40 bytes. This results in the XFS_AGFL_SIZE macro returning incorrect results for v5 filesystems on 64-bit machines (118 items instead of 119). As a result, a 32-bit xfs_repair will see garbage in AGFL item 119 and complain. Therefore, tell gcc not to pad the structure so that the AGFL size calculation is correct. cc: <stable@vger.kernel.org> # 3.10 - 4.4 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04libxfs: use a convenience variable instead of open-coding the forkDarrick J. Wong
Use a convenience variable instead of open-coding the inode fork. This isn't really needed for now, but will become important when we add the copy-on-write fork later. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04libxfs: make xfs_alloc_fix_freelist non-staticDarrick J. Wong
Since xfs_repair wants to use xfs_alloc_fix_freelist, remove the static designation. xfsprogs already has this; this simply brings the kernel up to date. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04xfs: get mp from bma->ip in xfs_bmap codeEric Sandeen
In my earlier commit c29aad4 xfs: pass mp to XFS_WANT_CORRUPTED_GOTO I added some local mp variables with code which indicates that mp might be NULL. Coverity doesn't like this now, because the updated per-fs XFS_STATS macros dereference mp. I don't think this is actually a problem; from what I can tell, we cannot get to these functions with a null bma->tp, so my NULL check was probably pointless. Still, it's not super obvious. So switch this code to get mp from the inode on the xfs_bmalloca structure, with no conditional, because the functions are already using bmap->ip directly. Addresses-Coverity-Id: 1339552 Addresses-Coverity-Id: 1339553 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04xfs: print name of verifier if it failsEric Sandeen
This adds a name to each buf_ops structure, so that if a verifier fails we can print the type of verifier that failed it. Should be a slight debugging aid, I hope. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04libxfs: Optimize the loop for xfs_bitmap_emptyJia He
If there is any non zero bit in a long bitmap, it can jump out of the loop and finish the function as soon as possible. Signed-off-by: Jia He <hejianet@gmail.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-04xfs: support a crc verification only log record passBrian Foster
Log recovery torn write detection uses CRC verification over a range of the active log to identify torn writes. Since the generic log recovery pass code implements a superset of the functionality required for CRC verification, it can be easily modified to support a CRC verification only pass. Create a new CRC pass type and update the log record processing helper to skip everything beyond CRC verification when in this mode. This pass will be invoked in subsequent patches to implement torn write detection. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03Merge branch 'xfs-dax-updates' into for-nextDave Chinner
2015-11-03Merge branch 'xfs-misc-fixes-for-4.4-2' into for-nextDave Chinner
2015-11-03xfs: Validate the length of on-disk ACLsAndreas Gruenbacher
In xfs_acl_from_disk, instead of trusting that xfs_acl.acl_cnt is correct, make sure that the length of the attributes is correct as well. Also, turn the aclp parameter into a const pointer. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-11-03xfs: introduce BMAPI_ZERO for allocating zeroed extentsDave Chinner
To enable DAX to do atomic allocation of zeroed extents, we need to drive the block zeroing deep into the allocator. Because xfs_bmapi_write() can return merged extents on allocation that were only partially allocated (i.e. requested range spans allocated and hole regions, allocation into the hole was contiguous), we cannot zero the extent returned from xfs_bmapi_write() as that can overwrite existing data with zeros. Hence we have to drive the extent zeroing into the allocation code, prior to where we merge the extents into the BMBT and return the resultant map. This means we need to propagate this need down to the xfs_alloc_vextent() and issue the block zeroing at this point. While this functionality is being introduced for DAX, there is no reason why it is specific to DAX - we can per-zero blocks during the allocation transaction on any type of device. It's just slow (and usually slower than unwritten allocation and conversion) on traditional block devices so doesn't tend to get used. We can, however, hook hardware zeroing optimisations via sb_issue_zeroout() to this operation, so it may be useful in future and hence the "allocate zeroed blocks" API needs to be implementation neutral. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12Merge branch 'xfs-misc-fixes-for-4.4-1' into for-nextDave Chinner
2015-10-12Merge branch 'xfs-logging-fixes' into for-nextDave Chinner
2015-10-12xfs: per-filesystem stats counter implementationBill O'Donnell
This patch modifies the stats counting macros and the callers to those macros to properly increment, decrement, and add-to the xfs stats counts. The counts for global and per-fs stats are correctly advanced, and cleared by writing a "1" to the corresponding clear file. global counts: /sys/fs/xfs/stats/stats per-fs counts: /sys/fs/xfs/sda*/stats/stats global clear: /sys/fs/xfs/stats/stats_clear per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear [dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around stats structures and macros. ] Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12xfs: avoid dependency on Linux XATTR_SIZE_MAXJan Tulak
Currently, we depends on Linux XATTR value for on disk definition. Which causes trouble on other platforms and maybe also if this value was to change. Fix it by creating a custom definition independent from those in Linux (although with the same values), so it is OK with the be16 fields used for holding these attributes. This patch reflects a change in xfsprogs. Signed-off-by: Jan Tulak <jtulak@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12xfs: prefix XATTR_LIST_MAX with XFS_Jan Tulak
Remove a hard dependency of Linux XATTR_LIST_MAX value by using a prefixed version. This patch reflects the same change in xfsprogs. Signed-off-by: Jan Tulak <jtulak@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12libxfs: fix two comment typosGeliang Tang
Just fix two typos in code comments. Signed-off-by: Geliang Tang <geliangtang@163.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12xfs: validate metadata LSNs against log on v5 superblocksBrian Foster
Since the onset of v5 superblocks, the LSN of the last modification has been included in a variety of on-disk data structures. This LSN is used to provide log recovery ordering guarantees (e.g., to ensure an older log recovery item is not replayed over a newer target data structure). While this works correctly from the point a filesystem is formatted and mounted, userspace tools have some problematic behaviors that defeat this mechanism. For example, xfs_repair historically zeroes out the log unconditionally (regardless of whether corruption is detected). If this occurs, the LSN of the filesystem is reset and the log is now in a problematic state with respect to on-disk metadata structures that might have a larger LSN. Until either the log catches up to the highest previously used metadata LSN or each affected data structure is modified and written out without incident (which resets the metadata LSN), log recovery is susceptible to filesystem corruption. This problem is ultimately addressed and repaired in the associated userspace tools. The kernel is still responsible to detect the problem and notify the user that something is wrong. Check the superblock LSN at mount time and fail the mount if it is invalid. From that point on, trigger verifier failure on any metadata I/O where an invalid LSN is detected. This results in a filesystem shutdown and guarantees that we do not log metadata changes with invalid LSNs on disk. Since this is a known issue with a known recovery path, present a warning to instruct the user how to recover. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-10-12xfs: log local to remote symlink conversions correctly on v5 supersBrian Foster
A local format symlink inode is converted to extent format when an extended attribute is set on an inode as part of the attribute fork creation. This means a block is allocated, the local symlink target name is copied to the block and the block is logged. Currently, xfs_bmap_local_to_extents() handles logging the remote block data based on the size of the data fork prior to the conversion. This is not correct on v5 superblock filesystems, which add an additional header to remote symlink blocks that is nonexistent in local format inodes. As a result, the full length of the remote symlink block content is not logged. This can lead to corruption should a crash occur and log recovery replay this transaction. Since a callout is already used to initialize the new remote symlink block, update the local-to-extents conversion mechanism to make the callout also responsible for logging the block. It is already required to set the log buffer type and format the block appropriately based on the superblock version. This ensures the remote symlink is always logged correctly. Note that xfs_bmap_local_to_extents() is only called for symlinks so there are no other callouts that require modification. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-09-01Merge branch 'xfs-misc-fixes-for-4.3-4' into for-nextxfs-for-linus-4.3Dave Chinner
2015-08-28libxfs: bad magic number should set da block buffer errorDarrick J. Wong
If xfs_da3_node_read_verify() doesn't recognize the magic number of a buffer it's just read, set the buffer error to -EFSCORRUPTED so that the error can be sent up to userspace. Without this patch we'll notice the bad magic eventually while trying to traverse or change the block, but we really ought to fail early in the verifier. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-25Merge branch 'xfs-misc-fixes-for-4.3-3' into for-nextDave Chinner
2015-08-25xfs: Fix file type directory corruption for btree directoriesJan Kara
Users have occasionally reported that file type for some directory entries is wrong. This mostly happened after updating libraries some libraries. After some debugging the problem was traced down to xfs_dir2_node_replace(). The function uses args->filetype as a file type to store in the replaced directory entry however it also calls xfs_da3_node_lookup_int() which will store file type of the current directory entry in args->filetype. Thus we fail to change file type of a directory entry to a proper type. Fix the problem by storing new file type in a local variable before calling xfs_da3_node_lookup_int(). cc: <stable@vger.kernel.org> # 3.16 - 4.x Reported-by: Giacomo Comes <comes@naic.edu> Signed-off-by: Jan Kara <jack@suse.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-25xfs: Fix uninitialized return value in xfs_alloc_fix_freelist()Jan Kara
xfs_alloc_fix_freelist() can sometimes jump to out_agbp_relse without ever setting value of 'error' variable which is then returned. This can happen e.g. when pag->pagf_init is set but AG is for metadata and we want to allocate user data. Fix the problem by initializing 'error' to 0, which is the desired return value when we decide to skip this group. CC: xfs@oss.sgi.com Coverity-id: 1309714 Signed-off-by: Jan Kara <jack@suse.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-20Merge branch 'xfs-misc-fixes-for-4.3-2' into for-nextDave Chinner
2015-08-19xfs: Fix xfs_attr_leafblock definitionJan Kara
struct xfs_attr_leafblock contains 'entries' array which is declared with size 1 altough it can in fact contain much more entries. Since this array is followed by further struct members, gcc (at least in version 4.8.3) thinks that the array has the fixed size of 1 element and thus may optimize away all accesses beyond the end of array resulting in non-working code. This problem was only observed with userspace code in xfsprogs, however it's better to be safe in kernel as well and have matching kernel and xfsprogs definitions. cc: <stable@vger.kernel.org> Signed-off-by: Jan Kara <jack@suse.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19libxfs: readahead of dir3 data blocks should use the read verifierDarrick J. Wong
In the dir3 data block readahead function, use the regular read verifier to check the block's CRC and spot-check the block contents instead of directly calling only the spot-checking routine. This prevents corrupted directory data blocks from being read into the kernel, which can lead to garbage ls output and directory loops (if say one of the entries contains slashes and other junk). cc: <stable@vger.kernel.org> # 3.12 - 4.2 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19xfs: stop holding ILOCK over filldir callbacksDave Chinner
The recent change to the readdir locking made in 40194ec ("xfs: reinstate the ilock in xfs_readdir") for CXFS directory sanity was probably the wrong thing to do. Deep in the readdir code we can take page faults in the filldir callback, and so taking a page fault while holding an inode ilock creates a new set of locking issues that lockdep warns all over the place about. The locking order for regular inodes w.r.t. page faults is io_lock -> pagefault -> mmap_sem -> ilock. The directory readdir code now triggers ilock -> page fault -> mmap_sem. While we cannot deadlock at this point, it inverts all the locking patterns that lockdep normally sees on XFS inodes, and so triggers lockdep. We worked around this with commit 93a8614 ("xfs: fix directory inode iolock lockdep false positive"), but that then just moved the lockdep warning to deeper in the page fault path and triggered on security inode locks. Fixing the shmem issue there just moved the lockdep reports somewhere else, and now we are getting false positives from filesystem freezing annotations getting confused. Further, if we enter memory reclaim in a readdir path, we now get lockdep warning about potential deadlocks because the ilock is held when we enter reclaim. This, again, is different to a regular file in that we never allow memory reclaim to run while holding the ilock for regular files. Hence lockdep now throws ilock->kmalloc->reclaim->ilock warnings. Basically, the problem is that the ilock is being used to protect the directory data and the inode metadata, whereas for a regular file the iolock protects the data and the ilock protects the metadata. From the VFS perspective, the i_mutex serialises all accesses to the directory data, and so not holding the ilock for readdir doesn't matter. The issue is that CXFS doesn't access directory data via the VFS, so it has no "data serialisaton" mechanism. Hence we need to hold the IOLOCK in the correct places to provide this low level directory data access serialisation. The ilock can then be used just when the extent list needs to be read, just like we do for regular files. The directory modification code can take the iolock exclusive when the ilock is also taken, and this then ensures that readdir is correct excluded while modifications are in progress. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19xfs: swap leaf buffer into path struct atomically during path shiftBrian Foster
The node directory lookup code uses a state structure that tracks the path of buffers used to search for the hash of a filename through the leaf blocks. When the lookup encounters a block that ends with the requested hash, but the entry has not yet been found, it must shift over to the next block and continue looking for the entry (i.e., duplicate hashes could continue over into the next block). This shift mechanism involves walking back up and down the state structure, replacing buffers at the appropriate btree levels as necessary. When a buffer is replaced, the old buffer is released and the new buffer read into the active slot in the path structure. Because the buffer is read directly into the path slot, a buffer read failure can result in setting a NULL buffer pointer in an active slot. This throws off the state cleanup code in xfs_dir2_node_lookup(), which expects to release a buffer from each active slot. Instead, a BUG occurs due to a NULL pointer dereference: BUG: unable to handle kernel NULL pointer dereference at 00000000000001e8 IP: [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs] ... RIP: 0010:[<ffffffffa0585063>] [<ffffffffa0585063>] xfs_trans_brelse+0x2a3/0x3c0 [xfs] ... Call Trace: [<ffffffffa05250c6>] xfs_dir2_node_lookup+0xa6/0x2c0 [xfs] [<ffffffffa0519f7c>] xfs_dir_lookup+0x1ac/0x1c0 [xfs] [<ffffffffa055d0e1>] xfs_lookup+0x91/0x290 [xfs] [<ffffffffa05580b3>] xfs_vn_lookup+0x73/0xb0 [xfs] [<ffffffff8122de8d>] lookup_real+0x1d/0x50 [<ffffffff8123330e>] path_openat+0x91e/0x1490 [<ffffffff81235079>] do_filp_open+0x89/0x100 ... This has been reproduced via a parallel fsstress and filesystem shutdown workload in a loop. The shutdown triggers the read error in the aforementioned codepath and causes the BUG in xfs_dir2_node_lookup(). Update xfs_da3_path_shift() to update the active path slot atomically with respect to the caller when a buffer is replaced. This ensures that the caller always sees the old or new buffer in the slot and prevents the NULL pointer dereference. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19xfs: relocate sparse inode mount warningBrian Foster
The sparse inodes feature is currently considered experimental. We warn at mount time from xfs_mount_validate_sb(). This function is part of the superblock verifier codepath, however, which means it could be invoked repeatedly on superblock reads or writes. This is currently only noticeable from userspace, where mkfs produces multiple warnings at format time. As mkfs warnings were not the intent of this change, relocate the mount time warning to xfs_fs_fill_super(), which is only invoked once and only in kernel space. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19xfs: set XFS_DA_OP_OKNOENT in xfs_attr_getEric Sandeen
It's entirely possible for userspace to ask for an xattr which does not exist. Normally, there is no problem whatsoever when we ask for such a thing, but when we look at an obfuscated metadump image on a debug kernel with selinux, we trip over this ASSERT in xfs_da3_path_shift(): *result = -ENOENT; /* we're out of our tree */ ASSERT(args->op_flags & XFS_DA_OP_OKNOENT); It (more or less) only shows up in the above scenario, because xfs_metadump obfuscates attr names, but chooses names which keep the same hash value - and xfs_da3_node_lookup_int does: if (((retval == -ENOENT) || (retval == -ENOATTR)) && (blk->hashval == args->hashval)) { error = xfs_da3_path_shift(state, &state->path, 1, 1, &retval); IOWS, we only get down to the xfs_da3_path_shift() ASSERT if we are looking for an xattr which doesn't exist, but we find xattrs on disk which have the same hash, and so might be a hash collision, so we try the path shift. When *that* fails to find what we're looking for, we hit the assert about XFS_DA_OP_OKNOENT. Simply setting XFS_DA_OP_OKNOENT in xfs_attr_get solves this rather corner-case problem with no ill side effects. It's fine for an attr name lookup to fail. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19Merge branch 'xfs-efi-rework' into for-nextDave Chinner
2015-08-19xfs: add missing bmap cancel calls in error pathsBrian Foster
If a failure occurs after the bmap free list is populated and before xfs_bmap_finish() completes successfully (which returns a partial list on failure), the bmap free list must be cancelled. Otherwise, the extent items on the list are never freed and a memory leak occurs. Several random error paths throughout the code suffer this problem. Fix these up such that xfs_bmap_cancel() is always called on error. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-08-19xfs: fix btree cursor error cleanupsBrian Foster
The btree cursor cleanup function takes an error parameter that affects how buffers are released from the cursor. All buffers are released in the event of error. Several callers do not specify the XFS_BTREE_ERROR flag in the event of error, however. This can cause buffers to hang around locked or with an elevated hold count and thus lead to umount hangs in the event of errors. Fix up the xfs_btree_del_cursor() callers to pass XFS_BTREE_ERROR if the cursor is being torn down due to error. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29Merge branch 'xfs-meta-uuid' into for-nextDave Chinner
2015-07-29Merge branch 'xfs-misc-fixes-for-4.3' into for-nextDave Chinner
2015-07-29xfs: create new metadata UUID field and incompat flagEric Sandeen
This adds a new superblock field, sb_meta_uuid. If set, along with a new incompat flag, the code will use that field on a V5 filesystem to compare to metadata UUIDs, which allows us to change the user- visible UUID at will. Userspace handles the setting and clearing of the incompat flag as appropriate, as the UUID gets changed; i.e. setting the user-visible UUID back to the original UUID (as stored in the new field) will remove the incompatible feature flag. If the incompat flag is not set, this copies the user-visible UUID into into the meta_uuid slot in memory when the superblock is read from disk; the meta_uuid field is not written back to disk in this case. The remainder of this patch simply switches verifiers, initializers, etc to use the new sb_meta_uuid field. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29libxfs: add xfs_bit.cDave Chinner
The header side of xfs_bit.c is already in libxfs, and the sparse inode code requires the xfs_next_bit() function so pull in the xfs_bit.c file so that a sparse inode enabled libxfs compiles cleanly in userspace. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-07-29xfs: Use consistent logging message prefixesJoe Perches
The second and subsequent lines of multi-line logging messages are not prefixed with the same information as the first line. Separate messages with newlines into multiple calls to ensure consistent prefixing and allow easier grep use. Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>