summaryrefslogtreecommitdiff
path: root/fs/btrfs
AgeCommit message (Collapse)Author
2014-09-26mm: non-atomically mark page accessed during page cache allocation where ↵Mel Gorman
possible commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream. aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26callers of iov_copy_from_user_atomic() don't need pagecache_disable()Al Viro
commit 9e8c2af96e0d2d5fe298dd796fb6bc16e888a48d upstream. ... it does that itself (via kmap_atomic()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-26mm + fs: prepare for non-page entries in page cache radix treesJohannes Weiner
commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream. shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-17Btrfs: fix crash on endio of reading corrupted blockLiu Bo
commit 38c1c2e44bacb37efd68b90b3f70386a8ee370ee upstream. The crash is ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:2124! [...] Workqueue: btrfs-endio normal_work_helper [btrfs] RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs] This is in fact a regression. It is because we forgot to increase @offset properly in reading corrupted block, so that the @offset remains, and this leads to checksum errors while reading left blocks queued up in the same bio, and then ends up with hiting the above BUG_ON. Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-17Btrfs: fix compressed write corruption on enospcLiu Bo
commit ce62003f690dff38d3164a632ec69efa15c32cbf upstream. When failing to allocate space for the whole compressed extent, we'll fallback to uncompressed IO, but we've forgotten to redirty the pages which belong to this compressed extent, and these 'clean' pages will simply skip 'submit' part and go to endio directly, at last we got data corruption as we write nothing. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Tested-By: Martin Steigerwald <martin@lichtvoll.de> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-17Btrfs: read lock extent buffer while walking backrefsFilipe Manana
commit 6f7ff6d7832c6be13e8c95598884dbc40ad69fb7 upstream. Before processing the extent buffer, acquire a read lock on it, so that we're safe against concurrent updates on the extent buffer. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-17Btrfs: fix csum tree corruption, duplicate and outdated checksumsFilipe Manana
commit 27b9a8122ff71a8cadfbffb9c4f0694300464f3b upstream. Under rare circumstances we can end up leaving 2 versions of a checksum for the same file extent range. The reason for this is that after calling btrfs_next_leaf we process slot 0 of the leaf it returns, instead of processing the slot set in path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after btrfs_next_leaf() releases the path and before it searches for the next leaf, another task might cause a split of the next leaf, which migrates some of its keys to the leaf we were processing before calling btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the same leaf but with path->slots[0] having a slot number corresponding to the first new key it got, that is, a slot number that didn't exist before calling btrfs_next_leaf(), as the leaf now has more keys than it had before. So we must really process the returned leaf starting at path->slots[0] always, as it isn't always 0, and the key at slot 0 can have an offset much lower than our search offset/bytenr. For example, consider the following scenario, where we have: sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568 four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472 Leaf N: slot = 0 slot = btrfs_header_nritems() - 1 |-------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] | |-------------------------------------------------------------------| Leaf N + 1: slot = 0 slot = btrfs_header_nritems() - 1 |--------------------------------------------------------------------| | [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 | |--------------------------------------------------------------------| Because we are at the last slot of leaf N, we call btrfs_next_leaf() to find the next highest key, which releases the current path and then searches for that next key. However after releasing the path and before finding that next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore btrfs_next_leaf() will returns us a path again with leaf N but with the slot pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N is then: slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1 |----------------------------------------------------------------------------------------------------| | [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM 40116224), size 4] [(CSUM CSUM 40161280), size 32] | |----------------------------------------------------------------------------------------------------| And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump into the "insert:" label, which will set tmp to: tmp = min((sums->len - total_bytes) >> blocksize_bits, (next_offset - file_key.offset) >> blocksize_bits) = min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) = min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4 and ins_size = csum_size * tmp = 4 * 4 = 16 bytes. In other words, we insert a new csum item in the tree with key (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong, because the item with key (CSUM CSUM 40161280) (the one that was moved from leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288 bytes of our data and won't get those old checksums removed. So this leaves us 2 different checksums for 3 4kb blocks of data in the tree, and breaks the logical rule: Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover An obvious bad effect of this is that a subsequent csum tree lookup to get the checksum of any of the blocks with logical offset of 40161280, 40165376 or 40169472 (the last 3 4kb blocks of file data), will get the old checksums. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-09-17Btrfs: Fix memory corruption by ulist_add_merge() on 32bit archTakashi Iwai
commit 4eb1f66dce6c4dc28dd90a7ffbe6b2b1cb08aa4e upstream. We've got bug reports that btrfs crashes when quota is enabled on 32bit kernel, typically with the Oops like below: BUG: unable to handle kernel NULL pointer dereference at 00000004 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs] *pde = 00000000 Oops: 0000 [#1] SMP CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs] task: f1478130 ti: f147c000 task.ti: f147c000 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0 EIP is at find_parent_nodes+0x360/0x1380 [btrfs] EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690 Stack: 00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050 00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000 00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000 Call Trace: [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs] [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs] [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs] [<c025e38b>] process_one_work+0x11b/0x390 [<c025eea1>] worker_thread+0x101/0x340 [<c026432b>] kthread+0x9b/0xb0 [<c0712a71>] ret_from_kernel_thread+0x21/0x30 [<c0264290>] kthread_create_on_node+0x110/0x110 This indicates a NULL corruption in prefs_delayed list. The further investigation and bisection pointed that the call of ulist_add_merge() results in the corruption. ulist_add_merge() takes u64 as aux and writes a 64bit value into old_aux. The callers of this function in backref.c, however, pass a pointer of a pointer to old_aux. That is, the function overwrites 64bit value on 32bit pointer. This caused a NULL in the adjacent variable, in this case, prefs_delayed. Here is a quick attempt to band-aid over this: a new function, ulist_add_merge_ptr() is introduced to pass/store properly a pointer value instead of u64. There are still ugly void ** cast remaining in the callers because void ** cannot be taken implicitly. But, it's safer than explicit cast to u64, anyway. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046 Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02btrfs: fix use of uninit "ret" in end_extent_writepage()Eric Sandeen
commit 3e2426bd0eb980648449e7a2f5a23e3cd3c7725c upstream. If this condition in end_extent_writepage() is false: if (tree->ops && tree->ops->writepage_end_io_hook) we will then test an uninitialized "ret" at: ret = ret < 0 ? ret : -EIO; The test for ret is for the case where ->writepage_end_io_hook failed, and we'd choose that ret as the error; but if there is no ->writepage_end_io_hook, nothing sets ret. Initializing ret to 0 should be sufficient; if writepage_end_io_hook wasn't set, (!uptodate) means non-zero err was passed in, so we choose -EIO in that case. Signed-of-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: fix scrub_print_warning to handle skinny metadata extentsLiu Bo
commit 6eda71d0c030af0fc2f68aaa676e6d445600855b upstream. The skinny extents are intepreted incorrectly in scrub_print_warning(), and end up hitting the BUG() in btrfs_extent_inline_ref_size. Reported-by: Konstantinos Skarlatos <k.skarlatos@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: use right type to get real comparisonLiu Bo
commit cd857dd6bc2ae9ecea14e75a34e8a8fdc158e307 upstream. We want to make sure the point is still within the extent item, not to verify the memory it's pointing to. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02fs: btrfs: volumes.c: Fix for possible null pointer dereferenceRickard Strandqvist
commit 8321cf2596d283821acc466377c2b85bcd3422b7 upstream. There is otherwise a risk of a possible null pointer dereference. Was largely found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: send, don't error in the presence of subvols/snapshotsFilipe Manana
commit 1af56070e3ef9477dbc7eba3b9ad7446979c7974 upstream. If we are doing an incremental send and the base snapshot has a directory with name X that doesn't exist anymore in the second snapshot and a new subvolume/snapshot exists in the second snapshot that has the same name as the directory (name X), the incremental send would fail with -ENOENT error. This is because it attempts to lookup for an inode with a number matching the objectid of a root, which doesn't exist. Steps to reproduce: mkfs.btrfs -f /dev/sdd mount /dev/sdd /mnt mkdir /mnt/testdir btrfs subvolume snapshot -r /mnt /mnt/mysnap1 rmdir /mnt/testdir btrfs subvolume create /mnt/testdir btrfs subvolume snapshot -r /mnt /mnt/mysnap2 btrfs send -p /mnt/mysnap1 /mnt/mysnap2 -f /tmp/send.data A test case for xfstests follows. Reported-by: Robert White <rwhite@pobox.com> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: set right total device count for seeding supportWang Shilong
commit 298658414a2f0bea1f05a81876a45c1cd96aa2e0 upstream. Seeding device support allows us to create a new filesystem based on existed filesystem. However newly created filesystem's @total_devices should include seed devices. This patch fix the following problem: # mkfs.btrfs -f /dev/sdb # btrfstune -S 1 /dev/sdb # mount /dev/sdb /mnt # btrfs device add -f /dev/sdc /mnt --->fs_devices->total_devices = 1 # umount /mnt # mount /dev/sdc /mnt --->fs_devices->total_devices = 2 This is because we record right @total_devices in superblock, but @fs_devices->total_devices is reset to be 0 in btrfs_prepare_sprout(). Fix this problem by not resetting @fs_devices->total_devices. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: mark mapping with error flag to report errors to userspaceLiu Bo
commit 5dca6eea91653e9949ce6eb9e9acab6277e2f2c4 upstream. According to commit 865ffef3797da2cac85b3354b5b6050dc9660978 (fs: fix fsync() error reporting), it's not stable to just check error pages because pages can be truncated or invalidated, we should also mark mapping with error flag so that a later fsync can catch the error. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: fix NULL pointer crash of deleting a seed deviceLiu Bo
commit 29cc83f69c8338ff8fd1383c9be263d4bdf52d73 upstream. Same as normal devices, seed devices should be initialized with fs_info->dev_root as well, otherwise we'll get a NULL pointer crash. Cc: Chris Murphy <lists@colorremedies.com> Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: make sure there are not any read requests before stopping workersWang Shilong
commit de348ee022175401e77d7662b7ca6e231a94e3fd upstream. In close_ctree(), after we have stopped all workers,there maybe still some read requests(for example readahead) to submit and this *maybe* trigger an oops that user reported before: kernel BUG at fs/btrfs/async-thread.c:619! By hacking codes, i can reproduce this problem with one cpu available. We fix this potential problem by invalidating all btree inode pages before stopping all workers. Thanks to Miao for pointing out this problem. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: output warning instead of error when loading free space cache failedMiao Xie
commit 32d6b47fe6fc1714d5f1bba1b9f38e0ab0ad58a8 upstream. If we fail to load a free space cache, we can rebuild it from the extent tree, so it is not a serious error, we should not output a error message that would make the users uncomfortable. This patch uses warning message instead of it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02btrfs: Add ctime/mtime update for btrfs device add/remove.Qu Wenruo
commit 5a1972bd9fd4b2fb1bac8b7a0b636d633d8717e3 upstream. Btrfs will send uevent to udev inform the device change, but ctime/mtime for the block device inode is not udpated, which cause libblkid used by btrfs-progs unable to detect device change and use old cache, causing 'btrfs dev scan; btrfs dev rmove; btrfs dev scan' give an error message. Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Cc: Karel Zak <kzak@redhat.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-07-02Btrfs: fix double free in find_lock_delalloc_rangeChris Mason
commit 7d78874273463a784759916fc3e0b4e2eb141c70 upstream. We need to NULL the cached_state after freeing it, otherwise we might free it again if find_delalloc_range doesn't find anything. Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-27Btrfs: fix BUG_ON() casued by the reserved space migrationMiao Xie
commit 20dd2cbf01888a91fdd921403040a710b275a1ff upstream. When we did space balance and snapshot creation at the same time, we might meet the following oops: kernel BUG at fs/btrfs/inode.c:3038! [SNIP] Call Trace: [<ffffffffa0411ec7>] btrfs_orphan_cleanup+0x293/0x407 [btrfs] [<ffffffffa042dc45>] btrfs_mksubvol.isra.28+0x259/0x373 [btrfs] [<ffffffffa042de85>] btrfs_ioctl_snap_create_transid+0x126/0x156 [btrfs] [<ffffffffa042dff1>] btrfs_ioctl_snap_create_v2+0xd0/0x121 [btrfs] [<ffffffffa0430b2c>] btrfs_ioctl+0x414/0x1854 [btrfs] [<ffffffff813b60b7>] ? __do_page_fault+0x305/0x379 [<ffffffff811215a9>] vfs_ioctl+0x1d/0x39 [<ffffffff81121d7c>] do_vfs_ioctl+0x32d/0x3e2 [<ffffffff81057fe7>] ? finish_task_switch+0x80/0xb8 [<ffffffff81121e88>] SyS_ioctl+0x57/0x83 [<ffffffff813b39ff>] ? do_device_not_available+0x12/0x14 [<ffffffff813b99c2>] system_call_fastpath+0x16/0x1b [SNIP] RIP [<ffffffffa040da40>] btrfs_orphan_add+0xc3/0x126 [btrfs] The reason of the problem is that the relocation root creation stole the reserved space, which was reserved for orphan item deletion. There are several ways to fix this problem, one is to increasing the reserved space size of the space balace, and then we can use that space to create the relocation tree for each fs/file trees. But it is hard to calculate the suitable size because we doesn't know how many fs/file trees we need relocate. We fixed this problem by reserving the space for relocation root creation actively since the space it need is very small (one tree block, used for root node copy), then we use that reserved space to create the relocation tree. If we don't reserve space for relocation tree creation, we will use the reserved space of the balance. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-27Btrfs: fix two use-after-free bugs with transaction cleanupJosef Bacik
commit 724e2315db3d59a8201d4a87c7c7a873e60e1ce0 upstream. I was noticing the slab redzone stuff going off every once and a while during transaction aborts. This was caused by two things 1) We would walk the pending snapshots and set their error to -ECANCELED. We don't need to do this, the snapshot stuff waits for a transaction commit and if there is a problem we just free our pending snapshot object and exit. Doing this was causing us to touch the pending snapshot object after the thing had already been freed. 2) We were freeing the transaction manually with wanton disregard for it's use_count reference counter. To fix this I cleaned up the transaction freeing loop to either wait for the transaction commit to finish if it was in the middle of that (since it will be cleaned and freed up there) or to do the cleanup oursevles. I also moved the global "kill all things dirty everywhere" stuff outside of the transaction cleanup loop since that only needs to be done once. With this patch I'm no longer seeing slab corruption because of use after frees. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-27Btrfs: don't delete ordered roots from list during cleanupJosef Bacik
commit 1de2cfde93c20a0357ff1dffed901598470facf3 upstream. During transaction cleanup after an abort we are just removing roots from the ordered roots list which is incorrect. We have a BUG_ON() to make sure that the root is still part of the ordered roots list when we put our ordered extent which we were tripping in this case. So do like we do everywhere else and just move it to the tail of the ordered roots list and allow the normal cleanup to take care of stuff. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-27Btrfs: cleanup transaction on abortJosef Bacik
commit 4e121c06adf53aae478ebce3035116595d063413 upstream. If we abort not during a transaction commit we won't clean up anything until we unmount. Unfortunately if we abort in the middle of writing out an ordered extent we won't clean it up and if somebody is waiting on that ordered extent they will wait forever. To fix this just make the transaction kthread call the cleanup transaction stuff if it notices theres an error, and make btrfs_end_transaction wake up the transaction kthread if there is an error. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-27Btrfs: do not release metadata for space cache inodesJosef Bacik
commit b6d08f0630d51ec09d67f16f6d7839699bbc0402 upstream. I've been testing our error paths and I was tripping the BUG_ON() in drop_outstanding_extent because our outstanding_extents is 0 for space cache inodes. This is because we don't reserve metadata space for these inodes since we depend on the global block reserve for our space. To fix this we need to make sure the DO_ACCOUNTING stuff doesn't actually call release_metadata for space cache inodes. With this patch I'm no longer panicing. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-27Btrfs: don't leak block group on errorFilipe David Borba Manana
commit e84cc14213e2c81ae5a2da341a9da0d58a1dbfad upstream. In extent-tree.c:btrfs_write_dirty_block_groups(), if the call to write_one_cache_group() failed, we would return without putting the block group first. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-27Btrfs: fix sync fs to actually wait for all data to be persistedFilipe David Borba Manana
commit 9b1998598625fb5b798e8291cafda1a8ec17c1bd upstream. Currently the fs sync function (super.c:btrfs_sync_fs()) doesn't wait for delayed work to finish before returning success to the caller. This change fixes this, ensuring that there's no data loss if a power failure happens right after fs sync returns success to the caller and before the next commit happens. Steps to reproduce the data loss issue: $ mkfs.btrfs -f /dev/sdb3 $ mount /dev/sdb3 /mnt/btrfs $ perl -e '$d = ("\x41" x 6001); open($f,">","/mnt/btrfs/foobar"); print $f $d; close($f);' && btrfs fi sync /mnt/btrfs Right after the btrfs fi sync command (a second or 2 for example), power off the machine and reboot it. The file will be empty, as it can be verified after mounting the filesystem and through btrfs-debug-tree: $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 7 transid 7 size 0 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar checksum tree key (CSUM_TREE ROOT_ITEM 0) leaf 29429760 items 0 free space 3995 generation 7 owner 7 fs uuid 6192815c-af2a-4b75-b3db-a959ffb6166e chunk uuid b529c44b-938c-4d3d-910a-013b4700bcae uuid tree key (UUID_TREE ROOT_ITEM 0) After this patch, the data loss no longer happens after a power failure and btrfs-debug-tree shows: $ btrfs-debug-tree /dev/sdb3 | egrep '\(257 INODE_ITEM 0\) itemoff' -B 3 -A 8 item 3 key (256 DIR_INDEX 2) itemoff 3751 itemsize 36 location key (257 INODE_ITEM 0) type FILE namelen 6 datalen 0 name: foobar item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160 inode generation 6 transid 6 size 6001 block group 0 mode 100644 links 1 item 5 key (257 INODE_REF 256) itemoff 3575 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 3522 itemsize 53 extent data disk byte 12845056 nr 8192 extent data offset 0 nr 8192 ram 8192 extent compression 0 checksum tree key (CSUM_TREE ROOT_ITEM 0) Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-27Btrfs: fix tracking of orphan inode countFilipe David Borba Manana
commit 703c88e035242202e3ab48fcbbbe0a7bc62fb7bb upstream. In inode.c:btrfs_orphan_add() if we failed to insert the orphan item, we would return without decrementing the orphan count that we just incremented before attempting the insertion, leaving the orphan inode count wrong. In inode.c:btrfs_orphan_del(), we were decrementing the inode orphan count if the bit BTRFS_INODE_ORPHAN_META_RESERVED was set, which is logically wrong because it should be decremented if the bit BTRFS_INODE_HAS_ORPHAN_ITEM was set - after all we increment the count when we set the bit BTRFS_INODE_HAS_ORPHAN_ITEM elsewhere. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-06-20btrfs: fix defrag 32-bit integer overflowJustin Maggard
commit c41570c9d29764f797fa35490d72b7395a0105c3 upstream. When defragging a very large file, the cluster variable can wrap its 32-bit signed int type and become negative, which eventually gets passed to btrfs_force_ra() as a very large unsigned long value. On 32-bit platforms, this eventually results in an Oops from the SLAB allocator. Change the cluster and max_cluster signed int variables to unsigned long to match the readahead functions. This also allows the min() comparison in btrfs_defrag_file() to work as intended. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-05-05Btrfs: fix deadlock with nested trans handlesJosef Bacik
commit 3bbb24b20a8800158c33eca8564f432dd14d0bf3 upstream. Zach found this deadlock that would happen like this btrfs_end_transaction <- reduce trans->use_count to 0 btrfs_run_delayed_refs btrfs_cow_block find_free_extent btrfs_start_transaction <- increase trans->use_count to 1 allocate chunk btrfs_end_transaction <- decrease trans->use_count to 0 btrfs_run_delayed_refs lock tree block we are cowing above ^^ We need to only decrease trans->use_count if it is above 1, otherwise leave it alone. This will make nested trans be the only ones who decrease their added ref, and will let us get rid of the trans->use_count++ hack if we have to commit the transaction. Thanks, Reported-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Tested-by: Zach Brown <zab@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-05-05Btrfs: skip submitting barrier for missing deviceHidetoshi Seto
commit f88ba6a2a44ee98e8d59654463dc157bb6d13c43 upstream. I got an error on v3.13: BTRFS error (device sdf1) in write_all_supers:3378: errno=-5 IO failure (errors while submitting device barriers.) how to reproduce: > mkfs.btrfs -f -d raid1 /dev/sdf1 /dev/sdf2 > wipefs -a /dev/sdf2 > mount -o degraded /dev/sdf1 /mnt > btrfs balance start -f -sconvert=single -mconvert=single -dconvert=single /mnt The reason of the error is that barrier_all_devices() failed to submit barrier to the missing device. However it is clear that we cannot do anything on missing device, and also it is not necessary to care chunks on the missing device. This patch stops sending/waiting barrier if device is missing. Signed-off-by: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-05fs: fix iversion handlingChristoph Hellwig
commit dff6efc326a4d5f305797d4a6bba14f374fdd633 upstream. Currently notify_change directly updates i_version for size updates, which not only is counter to how all other fields are updated through struct iattr, but also breaks XFS, which need inode updates to happen under its own lock, and synchronized to the structure that gets written to the log. Remove the update in the common code, and it to btrfs and ext4, XFS already does a proper updaste internally and currently gets a double update with the existing code. IMHO this is 3.13 and -stable material and should go in through the XFS tree. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Acked-by: Jan Kara <jack@suse.cz> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Ben Myers <bpm@sgi.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-01Btrfs: fix data corruption when reading/updating compressed extentsFilipe David Borba Manana
commit a2aa75e18a21b21952dc6daa9bac7c9f4426f81f upstream. When using a mix of compressed file extents and prealloc extents, it is possible to fill a page of a file with random, garbage data from some unrelated previous use of the page, instead of a sequence of zeroes. A simple sequence of steps to get into such case, taken from the test case I made for xfstests, is: _scratch_mkfs _scratch_mount "-o compress-force=lzo" $XFS_IO_PROG -f -c "pwrite -S 0x06 -b 18670 266978 18670" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "falloc 26450 665194" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "truncate 542872" $SCRATCH_MNT/foobar $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/foobar This results in the following file items in the fs tree: item 4 key (257 INODE_ITEM 0) itemoff 15879 itemsize 160 inode generation 6 transid 6 size 542872 block group 0 mode 100600 item 5 key (257 INODE_REF 256) itemoff 15863 itemsize 16 inode ref index 2 namelen 6 name: foobar item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53 extent data disk byte 0 nr 0 gen 6 extent data offset 0 nr 24576 ram 266240 extent compression 0 item 7 key (257 EXTENT_DATA 24576) itemoff 15757 itemsize 53 prealloc data disk byte 12849152 nr 241664 gen 6 prealloc data offset 0 nr 241664 item 8 key (257 EXTENT_DATA 266240) itemoff 15704 itemsize 53 extent data disk byte 12845056 nr 4096 gen 6 extent data offset 0 nr 20480 ram 20480 extent compression 2 item 9 key (257 EXTENT_DATA 286720) itemoff 15651 itemsize 53 prealloc data disk byte 13090816 nr 405504 gen 6 prealloc data offset 0 nr 258048 The on disk extent at offset 266240 (which corresponds to 1 single disk block), contains 5 compressed chunks of file data. Each of the first 4 compress 4096 bytes of file data, while the last one only compresses 3024 bytes of file data. Therefore a read into the file region [285648 ; 286720[ (length = 4096 - 3024 = 1072 bytes) should always return zeroes (our next extent is a prealloc one). The solution here is the compression code path to zero the remaining (untouched) bytes of the last page it uncompressed data into, as the information about how much space the file data consumes in the last page is not known in the upper layer fs/btrfs/extent_io.c:__do_readpage(). In __do_readpage we were correctly zeroing the remainder of the page but only if it corresponds to the last page of the inode and if the inode's size is not a multiple of the page size. This would cause not only returning random data on reads, but also permanently storing random data when updating parts of the region that should be zeroed. For the example above, it means updating a single byte in the region [285648 ; 286720[ would store that byte correctly but also store random data on disk. A test case for xfstests follows soon. Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-01Btrfs: fix tree mod loggingFilipe David Borba Manana
commit 5de865eebb8330eee19c37b31fb6f315a09d4273 upstream. While running the test btrfs/004 from xfstests in a loop, it failed about 1 time out of 20 runs in my desktop. The failure happened in the backref walking part of the test, and the test's error message was like this: btrfs/004 93s ... [failed, exit status 1] - output mismatch (see /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad) --- tests/btrfs/004.out 2013-11-26 18:25:29.263333714 +0000 +++ /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad 2013-12-10 15:25:10.327518516 +0000 @@ -1,3 +1,8 @@ QA output created by 004 *** test backref walking -*** done +unexpected output from + /home/fdmanana/git/hub/btrfs-progs/btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 +expected inum: 405, expected address: 454656, file: /home/fdmanana/btrfs-tests/scratch_1/snap1/p0/d6/d3d/d156/fce, got: + ... (Run 'diff -u tests/btrfs/004.out /home/fdmanana/git/hub/xfstests_2/results//btrfs/004.out.bad' to see the entire diff) Ran: btrfs/004 Failures: btrfs/004 Failed 1 of 1 tests But immediately after the test finished, the btrfs inspect-internal command returned the expected output: $ btrfs inspect-internal logical-resolve -P 141512704 /home/fdmanana/btrfs-tests/scratch_1 inode 405 offset 454656 root 258 inode 405 offset 454656 root 5 It turned out this was because the btrfs_search_old_slot() calls performed during backref walking (backref.c:__resolve_indirect_ref) were not finding anything. The reason for this turned out to be that the tree mod logging code was not logging some node multi-step operations atomically, therefore btrfs_search_old_slot() callers iterated often over an incomplete tree that wasn't fully consistent with any tree state from the past. Besides missing items, this often (but not always) resulted in -EIO errors during old slot searches, reported in dmesg like this: [ 4299.933936] ------------[ cut here ]------------ [ 4299.933949] WARNING: CPU: 0 PID: 23190 at fs/btrfs/ctree.c:1343 btrfs_search_old_slot+0x57b/0xab0 [btrfs]() [ 4299.933950] Modules linked in: btrfs raid6_pq xor pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) bnep rfcomm bluetooth parport_pc ppdev binfmt_misc joydev snd_hda_codec_h [ 4299.933977] CPU: 0 PID: 23190 Comm: btrfs Tainted: G W O 3.12.0-fdm-btrfs-next-16+ #70 [ 4299.933978] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z77 Pro4, BIOS P1.50 09/04/2012 [ 4299.933979] 000000000000053f ffff8806f3fd98f8 ffffffff8176d284 0000000000000007 [ 4299.933982] 0000000000000000 ffff8806f3fd9938 ffffffff8104a81c ffff880659c64b70 [ 4299.933984] ffff880659c643d0 ffff8806599233d8 ffff880701e2e938 0000160000000000 [ 4299.933987] Call Trace: [ 4299.933991] [<ffffffff8176d284>] dump_stack+0x55/0x76 [ 4299.933994] [<ffffffff8104a81c>] warn_slowpath_common+0x8c/0xc0 [ 4299.933997] [<ffffffff8104a86a>] warn_slowpath_null+0x1a/0x20 [ 4299.934003] [<ffffffffa065d3bb>] btrfs_search_old_slot+0x57b/0xab0 [btrfs] [ 4299.934005] [<ffffffff81775f3b>] ? _raw_read_unlock+0x2b/0x50 [ 4299.934010] [<ffffffffa0655001>] ? __tree_mod_log_search+0x81/0xc0 [btrfs] [ 4299.934019] [<ffffffffa06dd9b0>] __resolve_indirect_refs+0x130/0x5f0 [btrfs] [ 4299.934027] [<ffffffffa06a21f1>] ? free_extent_buffer+0x61/0xc0 [btrfs] [ 4299.934034] [<ffffffffa06de39c>] find_parent_nodes+0x1fc/0xe40 [btrfs] [ 4299.934042] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934048] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934056] [<ffffffffa06df980>] iterate_extent_inodes+0xe0/0x250 [btrfs] [ 4299.934058] [<ffffffff817762db>] ? _raw_spin_unlock+0x2b/0x50 [ 4299.934065] [<ffffffffa06dfb82>] iterate_inodes_from_logical+0x92/0xb0 [btrfs] [ 4299.934071] [<ffffffffa06b13e0>] ? defrag_lookup_extent+0xe0/0xe0 [btrfs] [ 4299.934078] [<ffffffffa06b7015>] btrfs_ioctl+0xf65/0x1f60 [btrfs] [ 4299.934080] [<ffffffff811658b8>] ? handle_mm_fault+0x278/0xb00 [ 4299.934083] [<ffffffff81075563>] ? up_read+0x23/0x40 [ 4299.934085] [<ffffffff8177a41c>] ? __do_page_fault+0x20c/0x5a0 [ 4299.934088] [<ffffffff811b2946>] do_vfs_ioctl+0x96/0x570 [ 4299.934090] [<ffffffff81776e23>] ? error_sti+0x5/0x6 [ 4299.934093] [<ffffffff810b71e8>] ? trace_hardirqs_off_caller+0x28/0xd0 [ 4299.934096] [<ffffffff81776a09>] ? retint_swapgs+0xe/0x13 [ 4299.934098] [<ffffffff811b2eb1>] SyS_ioctl+0x91/0xb0 [ 4299.934100] [<ffffffff813eecde>] ? trace_hardirqs_on_thunk+0x3a/0x3f [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934102] [<ffffffff8177ef12>] system_call_fastpath+0x16/0x1b [ 4299.934104] ---[ end trace 48f0cfc902491414 ]--- [ 4299.934378] btrfs bad fsid on block 0 These tree mod log operations that must be performed atomically, tree_mod_log_free_eb, tree_mod_log_eb_copy, tree_mod_log_insert_root and tree_mod_log_insert_move, used to be performed atomically before the following commit: c8cc6341653721b54760480b0d0d9b5f09b46741 (Btrfs: stop using GFP_ATOMIC for the tree mod log allocations) That change removed the atomicity of such operations. This patch restores the atomicity while still not doing the GFP_ATOMIC allocations of tree_mod_elem structures, so it has to do the allocations using GFP_NOFS before acquiring the mod log lock. This issue has been experienced by several users recently, such as for example: http://www.spinics.net/lists/linux-btrfs/msg28574.html After running the btrfs/004 test for 679 consecutive iterations with this patch applied, I didn't ran into the issue anymore. Cc: stable@vger.kernel.org Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-03-01Btrfs: return immediately if tree log mod is not necessaryFilipe David Borba Manana
commit 783577663507411e36e459390ef056556e93ef29 upstream. In ctree.c:tree_mod_log_set_node_key() we were calling __tree_mod_log_insert_key() even when the modification doesn't need to be logged. This would allocate a tree_mod_elem structure, fill it and pass it to __tree_mod_log_insert(), which would just acquire the tree mod log write lock and then free the tree_mod_elem structure and return (that is, a no-op). Therefore call tree_mod_log_insert() instead of __tree_mod_log_insert() which just returns immediately if the modification doesn't need to be logged (without allocating the structure, fill it, acquire write lock, free structure). Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Jiri Slaby <jslaby@suse.cz>
2014-02-20Btrfs: disable snapshot aware defrag for nowJosef Bacik
commit 8101c8dbf6243ba517aab58d69bf1bc37d8b7b9c upstream. It's just broken and it's taking a lot of effort to fix it, so for now just disable it so people can defrag in peace. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06btrfs: restrict snapshotting to own subvolumesDavid Sterba
commit d024206133ce21936b3d5780359afc00247655b7 upstream. Currently, any user can snapshot any subvolume if the path is accessible and thus indirectly create and keep files he does not own under his direcotries. This is not possible with traditional directories. In security context, a user can snapshot root filesystem and pin any potentially buggy binaries, even if the updates are applied. All the snapshots are visible to the administrator, so it's possible to verify if there are suspicious snapshots. Another more practical problem is that any user can pin the space used by eg. root and cause ENOSPC. Original report: https://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/484786 Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-06Btrfs: handle EAGAIN case properly in btrfs_drop_snapshot()Wang Shilong
commit 90515e7f5d7d24cbb2a4038a3f1b5cfa2921aa17 upstream. We may return early in btrfs_drop_snapshot(), we shouldn't call btrfs_std_err() for this case, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: fix lockdep error in async commitLiu Bo
commit b1a06a4b574996692b72b742bf6e6aa0c711a948 upstream. Lockdep complains about btrfs's async commit: [ 2372.462171] [ BUG: bad unlock balance detected! ] [ 2372.462191] 3.12.0+ #32 Tainted: G W [ 2372.462209] ------------------------------------- [ 2372.462228] ceph-osd/14048 is trying to release lock (sb_internal) at: [ 2372.462275] [<ffffffffa022cb10>] btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462305] but there are no more locks to release! [ 2372.462324] [ 2372.462324] other info that might help us debug this: [ 2372.462349] no locks held by ceph-osd/14048. [ 2372.462367] [ 2372.462367] stack backtrace: [ 2372.462386] CPU: 2 PID: 14048 Comm: ceph-osd Tainted: G W 3.12.0+ #32 [ 2372.462414] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080015 11/09/2011 [ 2372.462455] ffffffffa022cb10 ffff88007490fd28 ffffffff816f094a ffff8800378aa320 [ 2372.462491] ffff88007490fd50 ffffffff810adf4c ffff8800378aa320 ffff88009af97650 [ 2372.462526] ffffffffa022cb10 ffff88007490fd88 ffffffff810b01ee ffff8800898c0000 [ 2372.462562] Call Trace: [ 2372.462584] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462619] [<ffffffff816f094a>] dump_stack+0x45/0x56 [ 2372.462642] [<ffffffff810adf4c>] print_unlock_imbalance_bug+0xec/0x100 [ 2372.462677] [<ffffffffa022cb10>] ? btrfs_commit_transaction_async+0x1b0/0x2a0 [btrfs] [ 2372.462710] [<ffffffff810b01ee>] lock_release+0x18e/0x210 [ 2372.462742] [<ffffffffa022cb36>] btrfs_commit_transaction_async+0x1d6/0x2a0 [btrfs] [ 2372.462783] [<ffffffffa025a7ce>] btrfs_ioctl_start_sync+0x3e/0xc0 [btrfs] [ 2372.462822] [<ffffffffa025f1d3>] btrfs_ioctl+0x4c3/0x1f70 [btrfs] [ 2372.462849] [<ffffffff812c0321>] ? avc_has_perm+0x121/0x1b0 [ 2372.462873] [<ffffffff812c0224>] ? avc_has_perm+0x24/0x1b0 [ 2372.462897] [<ffffffff8107ecc8>] ? sched_clock_cpu+0xa8/0x100 [ 2372.462922] [<ffffffff8117b145>] do_vfs_ioctl+0x2e5/0x4e0 [ 2372.462946] [<ffffffff812c19e6>] ? file_has_perm+0x86/0xa0 [ 2372.462969] [<ffffffff8117b3c1>] SyS_ioctl+0x81/0xa0 [ 2372.462991] [<ffffffff817045a4>] tracesys+0xdd/0xe2 ==================================================== It's because that we don't do the right thing when checking if it's ok to tell lockdep that we're trying to release the rwsem. If the trans handle's type is TRANS_ATTACH, we won't acquire the freeze rwsem, but as TRANS_ATTACH fits the check (trans < TRANS_JOIN_NOLOCK), we'll release the freeze rwsem, which makes lockdep complains a lot. Reported-by: Ma Jianpeng <majianpeng@gmail.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: fix a crash when running balance and defrag concurrentlyLiu Bo
commit 48ec47364b6d493f0a9cdc116977bf3f34e5c3ec upstream. Running balance and defrag concurrently can end up with a crash: kernel BUG at fs/btrfs/relocation.c:4528! RIP: 0010:[<ffffffffa01ac33b>] [<ffffffffa01ac33b>] btrfs_reloc_cow_block+ 0x1eb/0x230 [btrfs] Call Trace: [<ffffffffa01398c1>] ? update_ref_for_cow+0x241/0x380 [btrfs] [<ffffffffa0180bad>] ? copy_extent_buffer+0xad/0x110 [btrfs] [<ffffffffa0139da1>] __btrfs_cow_block+0x3a1/0x520 [btrfs] [<ffffffffa013a0b6>] btrfs_cow_block+0x116/0x1b0 [btrfs] [<ffffffffa013ddad>] btrfs_search_slot+0x43d/0x970 [btrfs] [<ffffffffa0153c57>] btrfs_lookup_file_extent+0x37/0x40 [btrfs] [<ffffffffa0172a5e>] __btrfs_drop_extents+0x11e/0xae0 [btrfs] [<ffffffffa013b3fd>] ? generic_bin_search.constprop.39+0x8d/0x1a0 [btrfs] [<ffffffff8117d14a>] ? kmem_cache_alloc+0x1da/0x200 [<ffffffffa0138e7a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs] [<ffffffffa0173ef0>] btrfs_drop_extents+0x60/0x90 [btrfs] [<ffffffffa016b24d>] relink_extent_backref+0x2ed/0x780 [btrfs] [<ffffffffa0162fe0>] ? btrfs_submit_bio_hook+0x1e0/0x1e0 [btrfs] [<ffffffffa01b8ed7>] ? iterate_inodes_from_logical+0x87/0xa0 [btrfs] [<ffffffffa016b909>] btrfs_finish_ordered_io+0x229/0xac0 [btrfs] [<ffffffffa016c3b5>] finish_ordered_fn+0x15/0x20 [btrfs] [<ffffffffa018cbe5>] worker_loop+0x125/0x4e0 [btrfs] [<ffffffffa018cac0>] ? btrfs_queue_worker+0x300/0x300 [btrfs] [<ffffffff81075ea0>] kthread+0xc0/0xd0 [<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40 [<ffffffff8164796c>] ret_from_fork+0x7c/0xb0 [<ffffffff81075de0>] ? insert_kthread_work+0x40/0x40 ---------------------------------------------------------------------- It turns out to be that balance operation will bump root's @last_snapshot, which enables snapshot-aware defrag path, and backref walking stuff will find data reloc tree as refs' parent, and hit the BUG_ON() during COW. As data reloc tree's data is just for relocation purpose, and will be deleted right after relocation is done, it's unnecessary to walk those refs belonged to data reloc tree, it'd be better to skip them. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: do not run snapshot-aware defragment on errorLiu Bo
commit 6f519564d7d978c00351d9ab6abac3deeac31621 upstream. If something wrong happens in write endio, running snapshot-aware defragment can end up with undefined results, maybe a crash, so we should avoid it. In order to share similar code, this also adds a helper to free the struct for snapshot-aware defrag. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: take ordered root lock when removing ordered operations inodeJosef Bacik
commit 93858769172c4e3678917810e9d5de360eb991cc upstream. A user reported a list corruption warning from btrfs_remove_ordered_extent, it is because we aren't taking the ordered_root_lock when we remove the inode from the ordered operations list. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: stop using vfs_read in sendJosef Bacik
commit ed2590953bd06b892f0411fc94e19175d32f197a upstream. Apparently we don't actually close the files until we return to userspace, so stop using vfs_read in send. This is actually better for us since we can avoid all the extra logic of holding the file we're sending open and making sure to clean it up. This will fix people who have been hitting too many files open errors when trying to send. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: fix incorrect inode acl resetFilipe David Borba Manana
commit 8185554d3eb09d23a805456b6fa98dcbb34aa518 upstream. When a directory has a default ACL and a subdirectory is created under that directory, btrfs_init_acl() is called when the subdirectory's inode is created to initialize the inode's ACL (inherited from the parent directory) but it was clearing the ACL from the inode after setting it if posix_acl_create() returned success, instead of clearing it only if it returned an error. To reproduce this issue: $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt $ mkdir /mnt/acl $ setfacl -d --set u::rwx,g::rwx,o::- /mnt/acl $ getfacl /mnt/acl user::rwx group::rwx other::r-x default:user::rwx default:group::rwx default:other::--- $ mkdir /mnt/acl/dir1 $ getfacl /mnt/acl/dir1 user::rwx group::rwx other::--- After unmounting and mounting again the filesystem, fgetacl returned the expected ACL: $ umount /mnt/acl $ mount /dev/loop0 /mnt $ getfacl /mnt/acl/dir1 user::rwx group::rwx other::--- default:user::rwx default:group::rwx default:other::--- Meaning that the underlying xattr was persisted. Reported-by: Giuseppe Fierro <giuseppe@fierro.org> Signed-off-by: Filipe David Borba Manana <fdmanana@gmail.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: fix hole check in log_one_extentJosef Bacik
commit ed9e8af88e2551aaa6bf51d8063a2493e2d71597 upstream. I added an assert to make sure we were looking up aligned offsets for csums and I tripped it when running xfstests. This is because log_one_extent was checking if block_start == 0 for a hole instead of EXTENT_MAP_HOLE. This worked out fine in practice it seems, but it adds a lot of extra work that is uneeded. With this fix I'm no longer tripping my assert. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: fix memory leak of chunks' extent mapLiu Bo
commit 7d3d1744f8a7d62e4875bd69cc2192a939813880 upstream. As we're hold a ref on looking up the extent map, we need to drop the ref before returning to callers. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: reset intwrite on transaction abortJosef Bacik
commit e0228285a8cad70e4b7b4833cc650e36ecd8de89 upstream. If we abort a transaction in the middle of a commit we weren't undoing the intwrite locking. This patch fixes that problem. Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: do a full search everytime in btrfs_search_old_slotJosef Bacik
commit d4b4087c43cc00a196c5be57fac41f41309f1d56 upstream. While running some snashot aware defrag tests I noticed I was panicing every once and a while in key_search. This is because of the optimization that says if we find a key at slot 0 it will be at slot 0 all the way down the rest of the tree. This isn't the case for btrfs_search_old_slot since it will likely replay changes to a buffer if something has changed since we took our sequence number. So short circuit this optimization by setting prev_cmp to -1 every time we call key_search so we will do our normal binary search. With this patch I am no longer seeing the panics I was seeing before. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20btrfs: call mnt_drop_write after interrupted subvol deletionDavid Sterba
commit e43f998e47bae27e37e159915625e8d4b130153b upstream. If btrfs_ioctl_snap_destroy blocks on the mutex and the process is killed, mnt_write count is unbalanced and leads to unmountable filesystem. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-20Btrfs: fix access_ok() check in btrfs_ioctl_send()Dan Carpenter
commit 700ff4f095d78af0998953e922e041d75254518b upstream. The closing parenthesis is in the wrong place. We want to check "sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of "sizeof(*arg->clone_sources * arg->clone_sources_count)". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>