summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_io.c
AgeCommit message (Collapse)Author
2020-01-04btrfs: return error pointer from alloc_test_extent_bufferDan Carpenter
[ Upstream commit b6293c821ea8fa2a631a2112cd86cd435effeb8b ] Callers of alloc_test_extent_buffer have not correctly interpreted the return value as error pointer, as alloc_test_extent_buffer should behave as alloc_extent_buffer. The self-tests were unaffected but btrfs_find_create_tree_block could call both functions and that would cause problems up in the call chain. Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
2019-03-23Btrfs: fix corruption reading shared and compressed extents after hole punchingFilipe Manana
commit 8e928218780e2f1cf2f5891c7575e8f0b284fcce upstream. In the past we had data corruption when reading compressed extents that are shared within the same file and they are consecutive, this got fixed by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") and by commit 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents"). However there was a case that was missing in those fixes, which is when the shared and compressed extents are referenced with a non-zero offset. The following shell script creates a reproducer for this issue: #!/bin/bash mkfs.btrfs -f /dev/sdc &> /dev/null mount -o compress /dev/sdc /mnt/sdc # Create a file with 3 consecutive compressed extents, each has an # uncompressed size of 128Kb and a compressed size of 4Kb. for ((i = 1; i <= 3; i++)); do head -c 4096 /dev/zero for ((j = 1; j <= 31; j++)); do head -c 4096 /dev/zero | tr '\0' "\377" done done > /mnt/sdc/foobar sync echo "Digest after file creation: $(md5sum /mnt/sdc/foobar)" # Clone the first extent into offsets 128K and 256K. xfs_io -c "reflink /mnt/sdc/foobar 0 128K 128K" /mnt/sdc/foobar xfs_io -c "reflink /mnt/sdc/foobar 0 256K 128K" /mnt/sdc/foobar sync echo "Digest after cloning: $(md5sum /mnt/sdc/foobar)" # Punch holes into the regions that are already full of zeroes. xfs_io -c "fpunch 0 4K" /mnt/sdc/foobar xfs_io -c "fpunch 128K 4K" /mnt/sdc/foobar xfs_io -c "fpunch 256K 4K" /mnt/sdc/foobar sync echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" echo "Dropping page cache..." sysctl -q vm.drop_caches=1 echo "Digest after hole punching: $(md5sum /mnt/sdc/foobar)" umount /dev/sdc When running the script we get the following output: Digest after file creation: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar linked 131072/131072 bytes at offset 131072 128 KiB, 1 ops; 0.0033 sec (36.960 MiB/sec and 295.6830 ops/sec) linked 131072/131072 bytes at offset 262144 128 KiB, 1 ops; 0.0015 sec (78.567 MiB/sec and 628.5355 ops/sec) Digest after cloning: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Digest after hole punching: 5a0888d80d7ab1fd31c229f83a3bbcc8 /mnt/sdc/foobar Dropping page cache... Digest after hole punching: fba694ae8664ed0c2e9ff8937e7f1484 /mnt/sdc/foobar This happens because after reading all the pages of the extent in the range from 128K to 256K for example, we read the hole at offset 256K and then when reading the page at offset 260K we don't submit the existing bio, which is responsible for filling all the page in the range 128K to 256K only, therefore adding the pages from range 260K to 384K to the existing bio and submitting it after iterating over the entire range. Once the bio completes, the uncompressed data fills only the pages in the range 128K to 256K because there's no more data read from disk, leaving the pages in the range 260K to 384K unfilled. It is just a slightly different variant of what was solved by commit 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents"). Fix this by forcing a bio submit, during readpages(), whenever we find a compressed extent map for a page that is different from the extent map for the previous page or has a different starting offset (in case it's the same compressed extent), instead of the extent map's original start offset. A test case for fstests follows soon. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: 808f80b46790f ("Btrfs: update fix for read corruption of compressed and shared extents") Fixes: 005efedf2c7d0 ("Btrfs: fix read corruption of compressed and shared extents") Cc: stable@vger.kernel.org # 4.3+ Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16btrfs: struct-funcs, constify readersJeff Mahoney
commit 1cbb1f454e5321e47fc1e6b233066c7ccc979d15 upstream. We have reader helpers for most of the on-disk structures that use an extent_buffer and pointer as offset into the buffer that are read-only. We should mark them as const and, in turn, allow consumers of these interfaces to mark the buffers const as well. No impact on code, but serves as documentation that a buffer is intended not to be modified. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16Btrfs: memset to avoid stale content in btree leafLiu Bo
commit 851cd173f06045816528176001cf82948282029c upstream. This is an additional patch to "Btrfs: memset to avoid stale content in btree node block". This uses memset to initialize the unused space in a leaf to avoid potential stale content, which may be incurred by pushing items between sibling leaves. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-01-16Btrfs: memset to avoid stale content in btree node blockLiu Bo
commit 3eb548ee3a8042d95ad81be254e67a5222c24e03 upstream. During updating btree, we could push items between sibling nodes/leaves, for leaves data sections starts reversely from the end of the block while for nodes we only have key pairs which are stored one by one from the start of the block. So we could do try to push key pairs from one node to the next node right in the tree, and after that, we update the node's nritems to reflect the correct end while leaving the stale content in the node. One may intentionally corrupt the fs image and access the stale content by bumping the nritems and causes various crashes. This takes the in-memory @nritems as the correct one and gets to memset the unused part of a btree node. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Ben Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-04-13btrfs: fix incorrect error return ret being passed to mapping_set_errorColin Ian King
[ Upstream commit bff5baf8aa37a97293725a16c03f49872249c07e ] The setting of return code ret should be based on the error code passed into function end_extent_writepage and not on ret. Thanks to Liu Bo for spotting this mistake in the original fix I submitted. Detected by CoverityScan, CID#1414312 ("Logically dead code") Fixes: 5dca6eea91653e ("Btrfs: mark mapping with error flag to report errors to userspace") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <alexander.levin@microsoft.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-01-06Btrfs: fix memory leak in reading btree blocksLiu Bo
commit 2571e739677f1e4c0c63f5ed49adcc0857923625 upstream. So we can read a btree block via readahead or intentional read, and we can end up with a memory leak when something happens as follows, 1) readahead starts to read block A but does not wait for read completion, 2) btree_readpage_end_io_hook finds that block A is corrupted, and it needs to clear all block A's pages' uptodate bit. 3) meanwhile an intentional read kicks in and checks block A's pages' uptodate to decide which page needs to be read. 4) when some pages have the uptodate bit during 3)'s check so 3) doesn't count them for eb->io_pages, but they are later cleared by 2) so we has to readpage on the page, we get the wrong eb->io_pages which results in a memory leak of this block. This fixes the problem by firstly getting all pages's locking and then checking pages' uptodate bit. t1(readahead) t2(readahead endio) t3(the following read) read_extent_buffer_pages end_bio_extent_readpage for pg in eb: for page 0,1,2 in eb: if pg is uptodate: btree_readpage_end_io_hook(pg) num_reads++ if uptodate: eb->io_pages = num_reads SetPageUptodate(pg) _______________ for pg in eb: for page 3 in eb: read_extent_buffer_pages if pg is NOT uptodate: btree_readpage_end_io_hook(pg) for pg in eb: __extent_read_full_page(pg) sanity check reports something wrong if pg is uptodate: clear_extent_buffer_uptodate(eb) num_reads++ for pg in eb: eb->io_pages = num_reads ClearPageUptodate(page) _______________ for pg in eb: if pg is NOT uptodate: __extent_read_full_page(pg) So t3's eb->io_pages is not consistent with the number of pages it's reading, and during endio(), atomic_dec_and_test(&eb->io_pages) will get a negative number so that we're not able to free the eb. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-08-20block: add missing group association in bio-cloning functionsPaolo Valente
commit 20bd723ec6a3261df5e02250cd3a1fbb09a343f2 upstream. When a bio is cloned, the newly created bio must be associated with the same blkcg as the original bio (if BLK_CGROUP is enabled). If this operation is not performed, then the new bio is not associated with any group, and the group of the current task is returned when the group of the bio is requested. Depending on the cloning frequency, this may cause a large percentage of the bios belonging to a given group to be treated as if belonging to other groups (in most cases as if belonging to the root group). The expected group isolation may thereby be broken. This commit adds the missing association in bio-cloning functions. Fixes: da2f0f74cf7d ("Btrfs: add support for blkio controllers") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Nikolay Borisov <kernel@kyup.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-11-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge second patch-bomb from Andrew Morton: - most of the rest of MM - procfs - lib/ updates - printk updates - bitops infrastructure tweaks - checkpatch updates - nilfs2 update - signals - various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc, dma-debug, dma-mapping, ... * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (102 commits) ipc,msg: drop dst nil validation in copy_msg include/linux/zutil.h: fix usage example of zlib_adler32() panic: release stale console lock to always get the logbuf printed out dma-debug: check nents in dma_sync_sg* dma-mapping: tidy up dma_parms default handling pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode kexec: use file name as the output message prefix fs, seqfile: always allow oom killer seq_file: reuse string_escape_str() fs/seq_file: use seq_* helpers in seq_hex_dump() coredump: change zap_threads() and zap_process() to use for_each_thread() coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT) signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() signal: turn dequeue_signal_lock() into kernel_dequeue_signal() signals: kill block_all_signals() and unblock_all_signals() nilfs2: fix gcc uninitialized-variable warnings in powerpc build nilfs2: fix gcc unused-but-set-variable warnings MAINTAINERS: nilfs2: add header file for tracing nilfs2: add tracepoints for analyzing reading and writing metadata files ...
2015-11-06mm, page_alloc: distinguish between being unable to sleep, unwilling to ↵Mel Gorman
sleep and avoiding waking kswapd __GFP_WAIT has been used to identify atomic context in callers that hold spinlocks or are in interrupts. They are expected to be high priority and have access one of two watermarks lower than "min" which can be referred to as the "atomic reserve". __GFP_HIGH users get access to the first lower watermark and can be called the "high priority reserve". Over time, callers had a requirement to not block when fallback options were available. Some have abused __GFP_WAIT leading to a situation where an optimisitic allocation with a fallback option can access atomic reserves. This patch uses __GFP_ATOMIC to identify callers that are truely atomic, cannot sleep and have no alternative. High priority users continue to use __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers that want to wake kswapd for background reclaim. __GFP_WAIT is redefined as a caller that is willing to enter direct reclaim and wake kswapd for background reclaim. This patch then converts a number of sites o __GFP_ATOMIC is used by callers that are high priority and have memory pools for those requests. GFP_ATOMIC uses this flag. o Callers that have a limited mempool to guarantee forward progress clear __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall into this category where kswapd will still be woken but atomic reserves are not used as there is a one-entry mempool to guarantee progress. o Callers that are checking if they are non-blocking should use the helper gfpflags_allow_blocking() where possible. This is because checking for __GFP_WAIT as was done historically now can trigger false positives. Some exceptions like dm-crypt.c exist where the code intent is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to flag manipulations. o Callers that built their own GFP flags instead of starting with GFP_KERNEL and friends now also need to specify __GFP_KSWAPD_RECLAIM. The first key hazard to watch out for is callers that removed __GFP_WAIT and was depending on access to atomic reserves for inconspicuous reasons. In some cases it may be appropriate for them to use __GFP_HIGH. The second key hazard is callers that assembled their own combination of GFP flags instead of starting with something like GFP_KERNEL. They may now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless if it's missed in most cases as other activity will wake kswapd. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-10-21btrfs: extent_io: Introduce new function clear_record_extent_bits()Qu Wenruo
Introduce new function clear_record_extent_bits(), which will clear bits for given range and record the details about which ranges are cleared and how many bytes in total it changes. This provides the basis for later qgroup reserve codes. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21btrfs: extent_io: Introduce new function set_record_extent_bitsQu Wenruo
Introduce new function set_record_extent_bits(), which will not only set given bits, but also record how many bytes are changed, and detailed range info. This is quite important for later qgroup reserve framework. The number of bytes will be used to do qgroup reserve, and detailed range info will be used to cleanup for EQUOT case. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-14Btrfs: fix double range unlock of hole region when reading pageFilipe Manana
If when reading a page we find a hole and our caller had already locked the range (bio flags has the bit EXTENT_BIO_PARENT_LOCKED set), we end up unlocking the hole's range and then later our caller unlocks it again, which might have already been locked by some other task once the first unlock happened. Currently this can only happen during a call to the extent_same ioctl, as it's the only caller of __do_readpage() that sets the bit EXTENT_BIO_PARENT_LOCKED for bio flags. Fix this by leaving the unlock exclusively to the caller. Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-12Merge branch 'cleanup/messages' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
2015-10-09Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are small and assorted. Neil's is the oldest, I dropped the ball thinking he was going to send it in" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: support NFSv2 export Btrfs: open_ctree: Fix possible memory leak Btrfs: fix deadlock when finalizing block group creation Btrfs: update fix for read corruption of compressed and shared extents Btrfs: send, fix corner case for reference overwrite detection
2015-10-08btrfs: switch more printks to our helpersDavid Sterba
Convert the simple cases, not all functions provide a way to reach the fs_info. Also skipped debugging messages (print-tree, integrity checker and pr_debug) and messages that are printed from possibly unfinished mount. Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08btrfs: switch message printers to ratelimited variantsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08btrfs: switch message printers to ratelimited _in_rcu variantsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-05Btrfs: update fix for read corruption of compressed and shared extentsFilipe Manana
My previous fix in commit 005efedf2c7d ("Btrfs: fix read corruption of compressed and shared extents") was effective only if the compressed extents cover a file range with a length that is not a multiple of 16 pages. That's because the detection of when we reached a different range of the file that shares the same compressed extent as the previously processed range was done at extent_io.c:__do_contiguous_readpages(), which covers subranges with a length up to 16 pages, because extent_readpages() groups the pages in clusters no larger than 16 pages. So fix this by tracking the start of the previously processed file range's extent map at extent_readpages(). The following test case for fstests reproduces the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create our test file with a single extent of 64Kb that is going to # be compressed no matter which compression algo is used (zlib/lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 64K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone the compressed extent into an adjacent file offset. $CLONER_PROG -s 0 -d $((64 * 1024)) -l $((64 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo echo "File digest before unmount:" md5sum $SCRATCH_MNT/foo | _filter_scratch # Remount the fs or clear the page cache to trigger the bug in # btrfs. Because the extent has an uncompressed length that is a # multiple of 16 pages, all the pages belonging to the second range # of the file (64K to 128K), which points to the same extent as the # first range (0K to 64K), had their contents full of zeroes instead # of the byte 0xaa. This was a bug exclusively in the read path of # compressed extents, the correct data was stored on disk, btrfs # just failed to fill in the pages correctly. _scratch_remount echo "File digest after remount:" # Must match the digest we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: Timofey Titovets <nefelim4ag@gmail.com>
2015-09-25Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is an assorted set I've been queuing up: Jeff Mahoney tracked down a tricky one where we ended up starting IO on the wrong mapping for special files in btrfs_evict_inode. A few people reported this one on the list. Filipe found (and provided a test for) a difficult bug in reading compressed extents, and Josef fixed up some quota record keeping with snapshot deletion. Chandan killed off an accounting bug during DIO that lead to WARN_ONs as we freed inodes" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: keep dropped roots in cache until transaction commit Btrfs: Direct I/O: Fix space accounting btrfs: skip waiting on ordered range for special files Btrfs: fix read corruption of compressed and shared extents Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock Btrfs: don't initialize a space info as full to prevent ENOSPC
2015-09-15Btrfs: fix read corruption of compressed and shared extentsFilipe Manana
If a file has a range pointing to a compressed extent, followed by another range that points to the same compressed extent and a read operation attempts to read both ranges (either completely or part of them), the pages that correspond to the second range are incorrectly filled with zeroes. Consider the following example: File layout [0 - 8K] [8K - 24K] | | | | points to extent X, points to extent X, offset 4K, length of 8K offset 0, length 16K [extent X, compressed length = 4K uncompressed length = 16K] If a readpages() call spans the 2 ranges, a single bio to read the extent is submitted - extent_io.c:submit_extent_page() would only create a new bio to cover the second range pointing to the extent if the extent it points to had a different logical address than the extent associated with the first range. This has a consequence of the compressed read end io handler (compression.c:end_compressed_bio_read()) finish once the extent is decompressed into the pages covering the first range, leaving the remaining pages (belonging to the second range) filled with zeroes (done by compression.c:btrfs_clear_biovec_end()). So fix this by submitting the current bio whenever we find a range pointing to a compressed extent that was preceded by a range with a different extent map. This is the simplest solution for this corner case. Making the end io callback populate both ranges (or more, if we have multiple pointing to the same extent) is a much more complex solution since each bio is tightly coupled with a single extent map and the extent maps associated to the ranges pointing to the shared extent can have different offsets and lengths. The following test case for fstests triggers the issue: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner rm -f $seqres.full test_clone_and_read_compressed_extent() { local mount_opts=$1 _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount $mount_opts # Create a test file with a single extent that is compressed (the # data we write into it is highly compressible no matter which # compression algorithm is used, zlib or lzo). $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 4K" \ -c "pwrite -S 0xbb 4K 8K" \ -c "pwrite -S 0xcc 12K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # Now clone our extent into an adjacent offset. $CLONER_PROG -s $((4 * 1024)) -d $((16 * 1024)) -l $((8 * 1024)) \ $SCRATCH_MNT/foo $SCRATCH_MNT/foo # Same as before but for this file we clone the extent into a lower # file offset. $XFS_IO_PROG -f -c "pwrite -S 0xaa 8K 4K" \ -c "pwrite -S 0xbb 12K 8K" \ -c "pwrite -S 0xcc 20K 4K" \ $SCRATCH_MNT/bar | _filter_xfs_io $CLONER_PROG -s $((12 * 1024)) -d 0 -l $((8 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/bar echo "File digests before unmounting filesystem:" md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch # Evicting the inode or clearing the page cache before reading # again the file would also trigger the bug - reads were returning # all bytes in the range corresponding to the second reference to # the extent with a value of 0, but the correct data was persisted # (it was a bug exclusively in the read path). The issue happened # only if the same readpages() call targeted pages belonging to the # first and second ranges that point to the same compressed extent. _scratch_remount echo "File digests after mounting filesystem again:" # Must match the same digests we got before. md5sum $SCRATCH_MNT/foo | _filter_scratch md5sum $SCRATCH_MNT/bar | _filter_scratch } echo -e "\nTesting with zlib compression..." test_clone_and_read_compressed_extent "-o compress=zlib" _scratch_unmount echo -e "\nTesting with lzo compression..." test_clone_and_read_compressed_extent "-o compress=lzo" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo<quwenruo@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com>
2015-09-05Merge branch 'for-linus-4.3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has Jeff Mahoney's long standing trim patch that fixes corners where trims were missing. Omar has some raid5/6 fixes, especially for using scrub and device replace when devices are missing. Zhao Lie continues cleaning and fixing things, this series fixes some really hard to hit corners in xfstests. I had to pull it last merge window due to some deadlocks, but those are now resolved. I added support for Tejun's new blkio controllers. It seems to work well for single devices, we'll expand to multi-device as well" * 'for-linus-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (47 commits) btrfs: fix compile when block cgroups are not enabled Btrfs: fix file read corruption after extent cloning and fsync Btrfs: check if previous transaction aborted to avoid fs corruption btrfs: use __GFP_NOFAIL in alloc_btrfs_bio btrfs: Prevent from early transaction abort btrfs: Remove unused arguments in tree-log.c btrfs: Remove useless condition in start_log_trans() Btrfs: add support for blkio controllers Btrfs: remove unused mutex from struct 'btrfs_fs_info' Btrfs: fix parity scrub of RAID 5/6 with missing device Btrfs: fix device replace of a missing RAID 5/6 device Btrfs: add RAID 5/6 BTRFS_RBIO_REBUILD_MISSING operation Btrfs: count devices correctly in readahead during RAID 5/6 replace Btrfs: remove misleading handling of missing device scrub btrfs: fix clone / extent-same deadlocks Btrfs: fix defrag to merge tail file extent Btrfs: fix warning in backref walking btrfs: Add WARN_ON() for double lock in btrfs_tree_lock() btrfs: Remove root argument in extent_data_ref_count() btrfs: Fix wrong comment of btrfs_alloc_tree_block() ...
2015-08-21btrfs: fix compile when block cgroups are not enabledChris Mason
bio->bi_css and bio->bi_ioc don't exist when block cgroups are not on. This adds an ifdef around them. It's not perfect, but our use of bi_ioc is being removed in the 4.3 merge window. The bi_css usage really should go into bio_clone, but I want to make sure that doesn't introduce problems for other bio_clone use cases. Signed-off-by: Chris Mason <clm@fb.com>
2015-08-19btrfs: Prevent from early transaction abortMichal Hocko
Btrfs relies on GFP_NOFS allocation when committing the transaction but this allocation context is rather weak wrt. reclaim capabilities. The page allocator currently tries hard to not fail these allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem currently but there is an attempt to move away from the default no-fail behavior and allow these allocation to fail more eagerly. And this would lead to a pre-mature transaction abort as follows: [ 55.328093] Call Trace: [ 55.328890] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b [ 55.330518] [<ffffffff8108fa28>] ? console_unlock+0x334/0x363 [ 55.332738] [<ffffffff8110873e>] __alloc_pages_nodemask+0x81d/0x8d4 [ 55.334910] [<ffffffff81100752>] pagecache_get_page+0x10e/0x20c [ 55.336844] [<ffffffffa007d916>] alloc_extent_buffer+0xd0/0x350 [btrfs] [ 55.338973] [<ffffffffa0059d8c>] btrfs_find_create_tree_block+0x15/0x17 [btrfs] [ 55.341329] [<ffffffffa004f728>] btrfs_alloc_tree_block+0x18c/0x405 [btrfs] [ 55.343566] [<ffffffffa003fa34>] split_leaf+0x1e4/0x6a6 [btrfs] [ 55.345577] [<ffffffffa0040567>] btrfs_search_slot+0x671/0x831 [btrfs] [ 55.347679] [<ffffffff810682d7>] ? get_parent_ip+0xe/0x3e [ 55.349434] [<ffffffffa0041cb2>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs] [ 55.351681] [<ffffffffa004ecfb>] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs] [ 55.353979] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.356212] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.358378] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.360626] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.362894] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.365221] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.367273] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e [ 55.369047] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e [ 55.370654] [<ffffffff81186869>] do_fsync+0x34/0x4e [ 55.372246] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14 [ 55.373851] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f [ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory [ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction. [ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure). [ 55.384280] ------------[ cut here ]------------ [ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]() [...] [ 55.384337] Call Trace: [ 55.384353] [<ffffffff8154e6f0>] dump_stack+0x4f/0x7b [ 55.384357] [<ffffffff8107f717>] ? down_trylock+0x2d/0x37 [ 55.384359] [<ffffffff81046977>] warn_slowpath_common+0xa1/0xbb [ 55.384398] [<ffffffffa00a1d6b>] ? btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384400] [<ffffffff81046a34>] warn_slowpath_null+0x1a/0x1c [ 55.384423] [<ffffffffa00a1d6b>] btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384446] [<ffffffffa004e5f7>] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs] [ 55.384455] [<ffffffffa004e600>] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs] [ 55.384476] [<ffffffffa00512ea>] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.384499] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384521] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384543] [<ffffffffa0060221>] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.384565] [<ffffffffa0060e21>] ? start_transaction+0x192/0x534 [btrfs] [ 55.384588] [<ffffffffa0073428>] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.384591] [<ffffffff81186808>] vfs_fsync_range+0x8f/0x9e [ 55.384592] [<ffffffff81186833>] vfs_fsync+0x1c/0x1e [ 55.384593] [<ffffffff81186869>] do_fsync+0x34/0x4e [ 55.384594] [<ffffffff81186ab3>] SyS_fsync+0x10/0x14 [ 55.384595] [<ffffffff81554f97>] system_call_fastpath+0x12/0x6f [...] [ 55.384608] ---[ end trace c29799da1d4dd621 ]--- [ 55.437323] BTRFS info (device hdb1): forced readonly [ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry Fix this by being explicit about the no-fail behavior of this allocation path and use __GFP_NOFAIL. Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-13block: remove bio_get_nr_vecs()Kent Overstreet
We can always fill up the bio now, no need to estimate the possible size based on queue parameters. Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [hch: rebased and wrote a changelog] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-08-09Btrfs: add support for blkio controllersChris Mason
This attaches accounting information to bios as we submit them so the new blkio controllers can throttle on btrfs filesystems. Not much is required, we're just associating bios with blkcgs during clone, calling wbc_init_bio()/wbc_account_io() during writepages submission, and attaching the bios to the current context during direct IO. Finally if we are splitting bios during btrfs_map_bio, this attaches accounting information to the split. The end result is able to throttle nicely on single disk filesystems. A little more work is required for multi-device filesystems. Signed-off-by: Chris Mason <clm@fb.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-30Merge branch 'for-linus-4.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "Outside of our usual batch of fixes, this integrates the subvolume quota updates that Qu Wenruo from Fujitsu has been working on for a few releases now. He gets an extra gold star for making btrfs smaller this time, and fixing a number of quota corners in the process. Dave Sterba tested and integrated Anand Jain's sysfs improvements. Outside of exporting a symbol (ack'd by Greg) these are all internal to btrfs and it's mostly cleanups and fixes. Anand also attached some of our sysfs objects to our internal device management structs instead of an object off the super block. It will make device management easier overall and it's a better fit for how the sysfs files are used. None of the existing sysfs files are moved around. Thanks for all the fixes everyone" * 'for-linus-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (87 commits) btrfs: delayed-ref: double free in btrfs_add_delayed_tree_ref() Btrfs: Check if kobject is initialized before put lib: export symbol kobject_move() Btrfs: sysfs: add support to show replacing target in the sysfs Btrfs: free the stale device Btrfs: use received_uuid of parent during send Btrfs: fix use-after-free in btrfs_replay_log btrfs: wait for delayed iputs on no space btrfs: qgroup: Make snapshot accounting work with new extent-oriented qgroup. btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. btrfs: ulist: Add ulist_del() function. btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism. btrfs: qgroup: Switch to new extent-oriented qgroup mechanism. btrfs: qgroup: Switch rescan to new mechanism. btrfs: qgroup: Add new qgroup calculation function btrfs_qgroup_account_extents(). btrfs: backref: Add special time_seq == (u64)-1 case for btrfs_find_all_roots(). btrfs: qgroup: Add new function to record old_roots. btrfs: qgroup: Record possible quota-related extent for qgroup. btrfs: qgroup: Add function qgroup_update_counters(). ...
2015-06-25Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO update from Jens Axboe: "Nothing really major in here, mostly a collection of smaller optimizations and cleanups, mixed with various fixes. In more detail, this contains: - Addition of policy specific data to blkcg for block cgroups. From Arianna Avanzini. - Various cleanups around command types from Christoph. - Cleanup of the suspend block I/O path from Christoph. - Plugging updates from Shaohua and Jeff Moyer, for blk-mq. - Eliminating atomic inc/dec of both remaining IO count and reference count in a bio. From me. - Fixes for SG gap and chunk size support for data-less (discards) IO, so we can merge these better. From me. - Small restructuring of blk-mq shared tag support, freeing drivers from iterating hardware queues. From Keith Busch. - A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the IOPS mode the default for non-rotational storage" * 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits) cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL cfq-iosched: fix sysfs oops when attempting to read unconfigured weights cfq-iosched: move group scheduling functions under ifdef cfq-iosched: fix the setting of IOPS mode on SSDs blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file block, cgroup: implement policy-specific per-blkcg data block: Make CFQ default to IOPS mode on SSDs block: add blk_set_queue_dying() to blkdev.h blk-mq: Shared tag enhancements block: don't honor chunk sizes for data-less IO block: only honor SG gap prevention for merges that contain data block: fix returnvar.cocci warnings block, dm: don't copy bios for request clones block: remove management of bi_remaining when restoring original bi_end_io block: replace trylock with mutex_lock in blkdev_reread_part() block: export blkdev_reread_part() and __blkdev_reread_part() suspend: simplify block I/O handling block: collapse bio bit space block: remove unused BIO_RW_BLOCK and BIO_EOF flags block: remove BIO_EOPNOTSUPP ...
2015-06-03Btrfs: set UNWRITTEN for prealloc'ed extents in fiemapJosef Bacik
We should be doing this, it's weird we hadn't been doing this. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03Btrfs: wake up extent state waiters on unlock through clear_extent_bitsFilipe Manana
When we clear an extent state's EXTENT_LOCKED bit with clear_extent_bits() through free_io_failure(), we weren't waking up any tasks waiting for the extent's state EXTENT_LOCKED bit, leading to an hang. So make sure clear_extent_bits() ends up waking up any waiters if the bit EXTENT_LOCKED is supplied by its callers. Zygo Blaxell was experiencing such hangs at inode eviction time after file unlinks. Thanks to him for a set of scripts to reproduce the issue. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-05-19block: remove BIO_EOPNOTSUPPChristoph Hellwig
Since the big barrier rewrite/removal in 2007 we never fail FLUSH or FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag to help propagating those to the buffer_head layer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-11Btrfs: fix race when reusing stale extent buffers that leads to BUG_ONFilipe Manana
There's a race between releasing extent buffers that are flagged as stale and recycling them that makes us it the following BUG_ON at btrfs_release_extent_buffer_page: BUG_ON(extent_buffer_under_io(eb)) The BUG_ON is triggered because the extent buffer has the flag EXTENT_BUFFER_DIRTY set as a consequence of having been reused and made dirty by another concurrent task. Here follows a sequence of steps that leads to the BUG_ON. CPU 0 CPU 1 CPU 2 path->nodes[0] == eb X X->refs == 2 (1 for the tree, 1 for the path) btrfs_header_generation(X) == current trans id flag EXTENT_BUFFER_DIRTY set on X btrfs_release_path(path) unlocks X reads eb X X->refs incremented to 3 locks eb X btrfs_del_items(X) X becomes empty clean_tree_block(X) clear EXTENT_BUFFER_DIRTY from X btrfs_del_leaf(X) unlocks X extent_buffer_get(X) X->refs incremented to 4 btrfs_free_tree_block(X) X's range is not pinned X's range added to free space cache free_extent_buffer_stale(X) lock X->refs_lock set EXTENT_BUFFER_STALE on X release_extent_buffer(X) X->refs decremented to 3 unlocks X->refs_lock btrfs_release_path() unlocks X free_extent_buffer(X) X->refs becomes 2 __btrfs_cow_block(Y) btrfs_alloc_tree_block() btrfs_reserve_extent() find_free_extent() gets offset == X->start btrfs_init_new_buffer(X->start) btrfs_find_create_tree_block(X->start) alloc_extent_buffer(X->start) find_extent_buffer(X->start) finds eb X in radix tree free_extent_buffer(X) lock X->refs_lock test X->refs == 2 test bit EXTENT_BUFFER_STALE is set test !extent_buffer_under_io(eb) increments X->refs to 3 mark_extent_buffer_accessed(X) check_buffer_tree_ref(X) --> does nothing, X->refs >= 2 and EXTENT_BUFFER_TREE_REF is set in X clear EXTENT_BUFFER_STALE from X locks X btrfs_mark_buffer_dirty() set_extent_buffer_dirty(X) check_buffer_tree_ref(X) --> does nothing, X->refs >= 2 and EXTENT_BUFFER_TREE_REF is set sets EXTENT_BUFFER_DIRTY on X test and clear EXTENT_BUFFER_TREE_REF decrements X->refs to 2 release_extent_buffer(X) decrements X->refs to 1 unlock X->refs_lock unlock X free_extent_buffer(X) lock X->refs_lock release_extent_buffer(X) decrements X->refs to 0 btrfs_release_extent_buffer_page(X) BUG_ON(extent_buffer_under_io(X)) --> EXTENT_BUFFER_DIRTY set on X Fix this by making find_extent buffer wait for any ongoing task currently executing free_extent_buffer()/free_extent_buffer_stale() if the extent buffer has the stale flag set. A more clean alternative would be to always increment the extent buffer's reference count while holding its refs_lock spinlock but find_extent_buffer is a performance critical area and that would cause lock contention whenever multiple tasks search for the same extent buffer concurrently. A build server running a SLES 12 kernel (3.12 kernel + over 450 upstream btrfs patches backported from newer kernels) was hitting this often: [1212302.461948] kernel BUG at ../fs/btrfs/extent_io.c:4507! (...) [1212302.470219] CPU: 1 PID: 19259 Comm: bs_sched Not tainted 3.12.36-38-default #1 [1212302.540792] Hardware name: Supermicro PDSM4/PDSM4, BIOS 6.00 04/17/2006 [1212302.540792] task: ffff8800e07e0100 ti: ffff8800d6412000 task.ti: ffff8800d6412000 [1212302.540792] RIP: 0010:[<ffffffffa0507081>] [<ffffffffa0507081>] btrfs_release_extent_buffer_page.constprop.51+0x101/0x110 [btrfs] (...) [1212302.630008] Call Trace: [1212302.630008] [<ffffffffa05070cd>] release_extent_buffer+0x3d/0xa0 [btrfs] [1212302.630008] [<ffffffffa04c2d9d>] btrfs_release_path+0x1d/0xa0 [btrfs] [1212302.630008] [<ffffffffa04c5c7e>] read_block_for_search.isra.33+0x13e/0x3a0 [btrfs] [1212302.630008] [<ffffffffa04c8094>] btrfs_search_slot+0x3f4/0xa80 [btrfs] [1212302.630008] [<ffffffffa04cf5d8>] lookup_inline_extent_backref+0xf8/0x630 [btrfs] [1212302.630008] [<ffffffffa04d13dd>] __btrfs_free_extent+0x11d/0xc40 [btrfs] [1212302.630008] [<ffffffffa04d64a4>] __btrfs_run_delayed_refs+0x394/0x11d0 [btrfs] [1212302.630008] [<ffffffffa04db379>] btrfs_run_delayed_refs.part.66+0x69/0x280 [btrfs] [1212302.630008] [<ffffffffa04ed2ad>] __btrfs_end_transaction+0x2ad/0x3d0 [btrfs] [1212302.630008] [<ffffffffa04f7505>] btrfs_evict_inode+0x4a5/0x500 [btrfs] [1212302.630008] [<ffffffff811b9e28>] evict+0xa8/0x190 [1212302.630008] [<ffffffff811b0330>] do_unlinkat+0x1a0/0x2b0 I was also able to reproduce this on a 3.19 kernel, corresponding to Chris' integration branch from about a month ago, running the following stress test on a qemu/kvm guest (with 4 virtual cpus and 16Gb of ram): while true; do mkfs.btrfs -l 4096 -f -b `expr 20 \* 1024 \* 1024 \* 1024` /dev/sdd mount /dev/sdd /mnt snapshot_cmd="btrfs subvolume snapshot -r /mnt" snapshot_cmd="$snapshot_cmd /mnt/snap_\`date +'%H_%M_%S_%N'\`" fsstress -d /mnt -n 25000 -p 8 -x "$snapshot_cmd" -X 100 umount /mnt done Which usually triggers the BUG_ON within less than 24 hours: [49558.618097] ------------[ cut here ]------------ [49558.619732] kernel BUG at fs/btrfs/extent_io.c:4551! (...) [49558.620031] CPU: 3 PID: 23908 Comm: fsstress Tainted: G W 3.19.0-btrfs-next-7+ #3 [49558.620031] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [49558.620031] task: ffff8800319fc0d0 ti: ffff880220da8000 task.ti: ffff880220da8000 [49558.620031] RIP: 0010:[<ffffffffa0476b1a>] [<ffffffffa0476b1a>] btrfs_release_extent_buffer_page+0x20/0xe9 [btrfs] (...) [49558.620031] Call Trace: [49558.620031] [<ffffffffa0476c73>] release_extent_buffer+0x90/0xd3 [btrfs] [49558.620031] [<ffffffff8142b10c>] ? _raw_spin_lock+0x3b/0x43 [49558.620031] [<ffffffffa0477052>] ? free_extent_buffer+0x37/0x94 [btrfs] [49558.620031] [<ffffffffa04770ab>] free_extent_buffer+0x90/0x94 [btrfs] [49558.620031] [<ffffffffa04396d5>] btrfs_release_path+0x4a/0x69 [btrfs] [49558.620031] [<ffffffffa0444907>] __btrfs_free_extent+0x778/0x80c [btrfs] [49558.620031] [<ffffffffa044a485>] __btrfs_run_delayed_refs+0xad2/0xc62 [btrfs] [49558.728054] [<ffffffff811420d5>] ? kmemleak_alloc_recursive.constprop.52+0x16/0x18 [49558.728054] [<ffffffffa044c1e8>] btrfs_run_delayed_refs+0x6d/0x1ba [btrfs] [49558.728054] [<ffffffffa045917f>] ? join_transaction.isra.9+0xb9/0x36b [btrfs] [49558.728054] [<ffffffffa045a75c>] btrfs_commit_transaction+0x4c/0x981 [btrfs] [49558.728054] [<ffffffffa0434f86>] btrfs_sync_fs+0xd5/0x10d [btrfs] [49558.728054] [<ffffffff81155923>] ? iterate_supers+0x60/0xc4 [49558.728054] [<ffffffff8117966a>] ? do_sync_work+0x91/0x91 [49558.728054] [<ffffffff8117968a>] sync_fs_one_sb+0x20/0x22 [49558.728054] [<ffffffff81155939>] iterate_supers+0x76/0xc4 [49558.728054] [<ffffffff811798e8>] sys_sync+0x55/0x83 [49558.728054] [<ffffffff8142bbd2>] system_call_fastpath+0x12/0x17 Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-29Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extentForrest Liu
btrfs_release_extent_buffer_page() can't handle dummy extent that allocated by btrfs_clone_extent_buffer() properly. That is because reference count of pages that allocated by btrfs_clone_extent_buffer() was 2, 1 by alloc_page(), and another by attach_extent_buffer_page(). Running following command repeatly can check this memory leak problem btrfs inspect-internal inode-resolve 256 /mnt/btrfs Signed-off-by: Chien-Kuan Yeh <ckya@synology.com> Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: fix race on ENOMEM in alloc_extent_bufferOmar Sandoval
Consider the following interleaving of overlapping calls to alloc_extent_buffer: Call 1: - Successfully allocates a few pages with find_or_create_page - find_or_create_page fails, goto free_eb - Unlocks the allocated pages Call 2: - Calls find_or_create_page and gets a page in call 1's extent_buffer - Finds that the page is already associated with an extent_buffer - Grabs a reference to the half-written extent_buffer and calls mark_extent_buffer_accessed on it mark_extent_buffer_accessed will then try to call mark_page_accessed on a null page and panic. The fix is to decrement the reference count on the half-written extent_buffer before unlocking the pages so call 2 won't use it. We should also set exists = NULL in the case that we don't use exists to avoid accidentally returning a freed extent_buffer in an error case. Signed-off-by: Omar Sandoval <osandov@osandov.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-26btrfs: incorrect handling for fiemap_fill_next_extent returnChengyu Song
fiemap_fill_next_extent returns 0 on success, -errno on error, 1 if this was the last extent that will fit in user array. If 1 is returned, the return value may eventually returned to user space, which should not happen, according to manpage of ioctl. Signed-off-by: Chengyu Song <csong84@gatech.edu> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-03-21Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Most of these are fixing extent reservation accounting, or corners with tree writeback during commit. Josef's set does add a test, which isn't strictly a fix, but it'll keep us from making this same mistake again" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix outstanding_extents accounting in DIO Btrfs: add sanity test for outstanding_extents accounting Btrfs: just free dummy extent buffers Btrfs: account merges/splits properly Btrfs: prepare block group cache before writing Btrfs: fix ASSERT(list_empty(&cur_trans->dirty_bgs_list) Btrfs: account for the correct number of extents for delalloc reservations Btrfs: fix merge delalloc logic Btrfs: fix comp_oper to get right order Btrfs: catch transaction abortion after waiting for it btrfs: fix sizeof format specifier in btrfs_check_super_valid()
2015-03-17Btrfs: just free dummy extent buffersJosef Bacik
If we fail during our sanity tests we could get NULL deref's because we unload the module before the dummy extent buffers are free'd via RCU. So check for this case and just free the things directly. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com>
2015-02-19Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is mostly cleanups and fixes: - The raid5/6 cleanups from Zhao Lei fixup some long standing warts in the code and add improvements on top of the scrubbing support from 3.19. - Josef has round one of our ENOSPC fixes coming from large btrfs clusters here at FB. - Dave Sterba continues a long series of cleanups (thanks Dave), and Filipe continues hammering on corner cases in fsync and others This all was held up a little trying to track down a use-after-free in btrfs raid5/6. It's not clear yet if this is just made easier to trigger with this pull or if its a new bug from the raid5/6 cleanups. Dave Sterba is the only one to trigger it so far, but he has a consistent way to reproduce, so we'll get it nailed shortly" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (68 commits) Btrfs: don't remove extents and xattrs when logging new names Btrfs: fix fsync data loss after adding hard link to inode Btrfs: fix BUG_ON in btrfs_orphan_add() when delete unused block group Btrfs: account for large extents with enospc Btrfs: don't set and clear delalloc for O_DIRECT writes Btrfs: only adjust outstanding_extents when we do a short write btrfs: Fix out-of-space bug Btrfs: scrub, fix sleep in atomic context Btrfs: fix scheduler warning when syncing log Btrfs: Remove unnecessary placeholder in btrfs_err_code btrfs: cleanup init for list in free-space-cache btrfs: delete chunk allocation attemp when setting block group ro btrfs: clear bio reference after submit_one_bio() Btrfs: fix scrub race leading to use-after-free Btrfs: add missing cleanup on sysfs init failure Btrfs: fix race between transaction commit and empty block group removal btrfs: add more checks to btrfs_read_sys_array btrfs: cleanup, rename a few variables in btrfs_read_sys_array btrfs: add checks for sys_chunk_array sizes btrfs: more superblock checks, lower bounds on devices and sectorsize/nodesize ...
2015-02-14Btrfs: account for large extents with enospcJosef Bacik
On our gluster boxes we stream large tar balls of backups onto our fses. With 160gb of ram this means we get really large contiguous ranges of dirty data, but the way our ENOSPC stuff works is that as long as it's contiguous we only hold metadata reservation for one extent. The problem is we limit our extents to 128mb, so we'll end up with at least 800 extents so our enospc accounting is quite a bit lower than what we need. To keep track of this make sure we increase outstanding_extents for every multiple of the max extent size so we can be sure to have enough reserved metadata space. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-02-11page_writeback: put account_page_redirty() after set_page_dirty()Konstantin Khebnikov
Helper account_page_redirty() fixes dirty pages counter for redirtied pages. This patch puts it after dirtying and prevents temporary underflows of dirtied pages counters on zone/bdi and current->nr_dirtied. Signed-off-by: Konstantin Khebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-02btrfs: clear bio reference after submit_one_bio()Naohiro Aota
After submit_one_bio(), `bio' can go away. However submit_extent_page() leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM). It will cause invalid paging request when submit_extent_page() is called next time. I reproduced ENOMEM case with the following script (need CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS). #!/bin/bash dmesgout=dmesg.txt start=100000 end=300000 step=1000 # btrfs options device=/dev/vdb1 directory=/mnt/btrfs # fault-injection options percent=100 times=3 mkdir -p $directory || exit 1 mount -o compress $device $directory || exit 1 rm -f $directory/file || exit 1 dd if=/dev/zero of=$directory/file bs=1M count=512 || exit 1 for interval in `seq $start $step $end`; do dmesg -C echo 1 > /proc/sys/vm/drop_caches sync export FAILCMD_TYPE=fail_page_alloc ./failcmd.sh -p $percent -t $times -i $interval \ --ignore-gfp-highmem=N --ignore-gfp-wait=N --min-order=0 \ -- \ cat $directory/file > /dev/null dmesg > ${dmesgout} if grep -q BUG: ${dmesgout}; then cat ${dmesgout} exit 1 fi done umount $directory exit 0 Signed-off-by: Naohiro Aota <naota@elisp.net> Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21Btrfs: add ref_count and free function for btrfs_bioZhao Lei
1: ref_count is simple than current RBIO_HOLD_BBIO_MAP_BIT flag to keep btrfs_bio's memory in raid56 recovery implement. 2: free function for bbio will make code clean and flexible, plus forced data type checking in compile. Changelog v1->v2: Rename following by David Sterba's suggestion: put_btrfs_bio() -> btrfs_put_bio() get_btrfs_bio() -> btrfs_get_bio() bbio->ref_count -> bbio->refs Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-21btrfs: switch extent_state state to unsignedDavid Sterba
Currently there's a 4B hole in the structure between refs and state and there are only 16 bits used so we can make it unsigned. This will get a better packing and may save some stack space for local variables. The size of extent_state gets reduced by 8B and there are usually a lot of slab objects. struct extent_state { u64 start; /* 0 8 */ u64 end; /* 8 8 */ struct rb_node rb_node; /* 16 24 */ wait_queue_head_t wq; /* 40 24 */ /* --- cacheline 1 boundary (64 bytes) --- */ atomic_t refs; /* 64 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int state; /* 72 8 */ u64 private; /* 80 8 */ /* size: 88, cachelines: 2, members: 7 */ /* sum members: 84, holes: 1, sum holes: 4 */ /* last cacheline: 24 bytes */ }; Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-01-19btrfs: fix state->private cast on 32 bit machinesSatoru Takeuchi
Suppress the following warning displayed on building 32bit (i686) kernel. =============================================================================== ... CC [M] fs/btrfs/extent_io.o fs/btrfs/extent_io.c: In function ‘btrfs_free_io_failure_record’: fs/btrfs/extent_io.c:2193:13: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast] failrec = (struct io_failure_record *)state->private; ... =============================================================================== Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Reported-by: Chris Murphy <chris@colorremedies.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-12-12btrfs: sink parameter len to alloc_extent_bufferDavid Sterba
Because we're using globally known nodesize. Do the same for the sanity test function variant. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12btrfs: unify extent buffer allocation apiDavid Sterba
Make the extent buffer allocation interface consistent. Cloned eb will set a valid fs_info. For dummy eb, we can drop the length parameter and set it from fs_info. The built-in sanity checks may pass a NULL fs_info that's queried for nodesize, but we know it's 4096. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-12-12btrfs: use GFP_NOFS in __alloc_extent_buffer directlyDavid Sterba
Same mask from all callers. Signed-off-by: David Sterba <dsterba@suse.cz>
2014-11-20Btrfs: avoid premature -ENOMEM in clear_extent_bit()Filipe Manana
We try to allocate an extent state structure before acquiring the extent state tree's spinlock as we might need a new one later and therefore avoid doing later an atomic allocation while holding the tree's spinlock. However we returned -ENOMEM if that initial non-atomic allocation failed, which is a bit excessive since we might end up not needing the pre-allocated extent state at all - for the case where the tree doesn't have any extent states that cover the input range and cover too any other range. Therefore don't return -ENOMEM if that pre-allocation fails. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2014-11-20Btrfs: avoid returning -ENOMEM in convert_extent_bit() too earlyFilipe Manana
We try to allocate an extent state before acquiring the tree's spinlock just in case we end up needing to split an existing extent state into two. If that allocation failed, we would return -ENOMEM. However, our only single caller (transaction/log commit code), passes in an extent state that was cached from a call to find_first_extent_bit() and that has a very high chance to match exactly the input range (always true for a transaction commit and very often, but not always, true for a log commit) - in this case we end up not needing at all that initial extent state used for an eventual split. Therefore just don't return -ENOMEM if we can't allocate the temporary extent state, since we might not need it at all, and if we end up needing one, we'll do it later anyway. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>