summaryrefslogtreecommitdiff
path: root/fs/btrfs/extent_map.c
AgeCommit message (Collapse)Author
2023-03-06btrfs: fix extent map logging bit not cleared for split maps after dropping ↵Filipe Manana
range At btrfs_drop_extent_map_range() we are clearing the EXTENT_FLAG_LOGGING bit on a 'flags' variable that was not initialized. This makes static checkers complain about it, so initialize the 'flags' variable before clearing the bit. In practice this has no consequences, because EXTENT_FLAG_LOGGING should not be set when btrfs_drop_extent_map_range() is called, as an fsync locks the inode in exclusive mode, locks the inode's mmap semaphore in exclusive mode too and it always flushes all delalloc. Also add a comment about why we clear EXTENT_FLAG_LOGGING on a copy of the flags of the split extent map. Reported-by: Dan Carpenter <error27@gmail.com> Link: https://lore.kernel.org/linux-btrfs/Y%2FyipSVozUDEZKow@kili/ Fixes: db21370bffbc ("btrfs: drop extent map range more efficiently") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: remove no longer used btrfs_next_extent_map()Filipe Manana
There are no more users of btrfs_next_extent_map(), the previous patch in the series ("btrfs: search for delalloc more efficiently during lseek/fiemap") removed the last usage of the function, so delete it. This change is part of a patchset that has the goal to make performance better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to iterate over the extents of a file. Two examples are the cp program from coreutils 9.0+ and the tar program (when using its --sparse / -S option). A sample test and results are listed in the changelog of the last patch in the series: 1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree 2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap 3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap 4/9 btrfs: search for delalloc more efficiently during lseek/fiemap 5/9 btrfs: remove no longer used btrfs_next_extent_map() 6/9 btrfs: allow passing a cached state record to count_range_bits() 7/9 btrfs: update stale comment for count_range_bits() 8/9 btrfs: use cached state when looking for delalloc ranges with fiemap 9/9 btrfs: use cached state when looking for delalloc ranges with lseek Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/ Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: selftests: remove impossible inline extent at non-zero file offsetQu Wenruo
In our inode-tests.c, we create an inline offset at file offset 5, which is no longer possible since the introduction of tree-checker. Thus I don't think we should spend time maintaining some corner cases which are already ruled out by tree-checker. So this patch will: - Change the inline extent to start at file offset 0 Also change its length to 6 to cover the original length - Add an extra ASSERT() for btrfs_add_extent_mapping() This is to make sure tree-checker is working correctly. - Update the inode selftest Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: update function commentsDavid Sterba
Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: move the printk helpers out of ctree.hJosef Bacik
We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-12-05btrfs: get the next extent map during fiemap/lseek more efficientlyFilipe Manana
At find_delalloc_subrange(), when we need to get the next extent map, we do a full search on the extent map tree (a red black tree). This is fine but it's a lot more efficient to simply use rb_next(), which typically requires iterating over less nodes of the tree and never needs to compare the ranges of nodes with the one we are looking for. So add a public helper to extent_map.{h,c} to get the extent map that immediately follows another extent map, using rb_next(), and use that helper at find_delalloc_subrange(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: drop extent map range more efficientlyFilipe Manana
Currently when dropping extent maps for a file range, through btrfs_drop_extent_map_range(), we do the following non-optimal things: 1) We lookup for extent maps one by one, always starting the search from the root of the extent map tree. This is not efficient if we have multiple extent maps in the range; 2) We check on every iteration if we have the 'split' and 'split2' spare extent maps in case we need to split an extent map that intersects our range but also crosses its boundaries (to the left, to the right or both cases). If our target range is for example: [2M, 8M) And we have 3 extents maps in the range: [1M, 3M) [3M, 6M) [6M, 10M[ The on the first iteration we allocate two extent maps for 'split' and 'split2', and use the 'split' to split the first extent map, so after the split we set 'split' to 'split2' and then set 'split2' to NULL. On the second iteration, we don't need to split the second extent map, but because 'split2' is now NULL, we allocate a new extent map for 'split2'. On the third iteration we need to split the third extent map, so we use the extent map pointed by 'split'. So we ended up allocating 3 extent maps for splitting, but all we needed was 2 extent maps. We never need to allocate more than 2, because extent maps that need to be split are always the first one and the last one in the target range. Improve on this by: 1) Using rb_next() to move on to the next extent map. This results in iterating over less nodes of the tree and it does not require comparing the ranges of nodes to our start/end offset; 2) Allocate the 2 extent maps for splitting before entering the loop and never allocate more than 2. In practice it's very rare to have the combination of both extent map allocations fail, since we have a dedicated slab for extent maps, and also have the need to split two extent maps. This patch is part of a patchset comprised of the following patches: btrfs: fix missed extent on fsync after dropping extent maps btrfs: move btrfs_drop_extent_cache() to extent_map.c btrfs: use extent_map_end() at btrfs_drop_extent_map_range() btrfs: use cond_resched_rwlock_write() during inode eviction btrfs: move open coded extent map tree deletion out of inode eviction btrfs: add helper to replace extent map range with a new extent map btrfs: remove the refcount warning/check at free_extent_map() btrfs: remove unnecessary extent map initializations btrfs: assert tree is locked when clearing extent map from logging btrfs: remove unnecessary NULL pointer checks when searching extent maps btrfs: remove unnecessary next extent map search btrfs: avoid pointless extent map tree search when flushing delalloc btrfs: drop extent map range more efficiently And the following fio test was done before and after applying the whole patchset, on a non-debug kernel (Debian's default kernel config) on a 12 cores Intel box with 64G of ram: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=8 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=psync filesize=2G runtime=300 time_based directory=$MNT numjobs=8 thread EOF echo performance | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Result before applying the patchset: WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec Result after applying the patchset: WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: remove unnecessary next extent map searchFilipe Manana
At __tree_search(), and its single caller __lookup_extent_mapping(), there is no point in finding the next extent map that starts after the search offset if we were able to find the previous extent map that ends before our search offset, because __lookup_extent_mapping() ignores the next acceptable extent map if we were able to find the previous one. So just return immediately if we were able to find the previous extent map, therefore avoiding wasting time iterating the tree looking for the next extent map which will not be used by __lookup_extent_mapping(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: remove unnecessary NULL pointer checks when searching extent mapsFilipe Manana
The previous and next pointer arguments passed to __tree_search() are never NULL as the only caller of this function, __lookup_extent_mapping(), always passes the address of two on stack pointers. So remove the NULL checks and add assertions to verify the pointers. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: assert tree is locked when clearing extent map from loggingFilipe Manana
When calling clear_em_logging() we should have a write lock on the extent map tree, as we will try to merge the extent map with the previous and next ones in the tree. So assert that we have a write lock. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: remove unnecessary extent map initializationsFilipe Manana
When allocating an extent map, we use kmem_cache_zalloc() which guarantees the returned memory is initialized to zeroes, therefore it's pointless to initialize the generation and flags of the extent map to zero again. Remove those initializations, as they are pointless and slightly increase the object text size. Before removing them: $ size fs/btrfs/extent_map.o text data bss dec hex filename 9241 274 24 9539 2543 fs/btrfs/extent_map.o After removing them: $ size fs/btrfs/extent_map.o text data bss dec hex filename 9209 274 24 9507 2523 fs/btrfs/extent_map.o Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: remove the refcount warning/check at free_extent_map()Filipe Manana
At free_extent_map(), it's pointless to have a WARN_ON() to check if the refcount of the extent map is zero. Such check is already done by the refcount_t module and refcount_dec_and_test(), which loudly complains if we try to decrement a reference count that is currently 0. The WARN_ON() dates back to the time when used a regular atomic_t type for the reference counter, before we switched to the refcount_t type. The main goal of the refcount_t type/module is precisely to catch such types of bugs and loudly complain if they happen. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: add helper to replace extent map range with a new extent mapFilipe Manana
We have several places that need to drop all the extent maps in a given file range and then add a new extent map for that range. Currently they call btrfs_drop_extent_map_range() to delete all extent maps in the range and then keep trying to add the new extent map in a loop that keeps retrying while the insertion of the new extent map fails with -EEXIST. So instead of repeating this logic, add a helper to extent_map.c that does these steps and name it btrfs_replace_extent_map_range(). Also add a comment about why the retry loop is necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: move open coded extent map tree deletion out of inode evictionFilipe Manana
Move the loop that removes all the extent maps from the inode's extent map tree during inode eviction out of inode.c and into extent_map.c, to btrfs_drop_extent_map_range(). Anything manipulating extent maps or the extent map tree should be in extent_map.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: use extent_map_end() at btrfs_drop_extent_map_range()Filipe Manana
Instead of open coding the end offset calculation of an extent map, use the helper extent_map_end() and cache its result in a local variable, since it's used several times. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-29btrfs: move btrfs_drop_extent_cache() to extent_map.cFilipe Manana
The function btrfs_drop_extent_cache() doesn't really belong at file.c because what it does is drop a range of extent maps for a file range. It directly allocates and manipulates extent maps, by dropping, splitting and replacing them in an extent map tree, so it should be located at extent_map.c, where all manipulations of an extent map tree and its extent maps are supposed to be done. So move it out of file.c and into extent_map.c. Additionally do the following changes: 1) Rename it into btrfs_drop_extent_map_range(), as this makes it more clear about what it does. The term "cache" is a bit confusing as it's not widely used, "extent maps" or "extent mapping" is much more common; 2) Change its 'skip_pinned' argument from int to bool; 3) Turn several of its local variables from int to bool, since they are used as booleans; 4) Move the declaration of some variables out of the function's main scope and into the scopes where they are used; 5) Remove pointless assignment of false to 'modified' early in the while loop, as later that variable is set and it's not used before that second assignment; 6) Remove checks for NULL before calling free_extent_map(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: replace delete argument with EXTENT_CLEAR_ALL_BITSJosef Bacik
Instead of taking up a whole argument to indicate we're clearing everything in a range, simply add another EXTENT bit to control this, and then update all the callers to drop this argument from the clear_extent_bit variants. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26btrfs: remove the wake argument from clear_extent_bitsJosef Bacik
This is only used in the case that we are clearing EXTENT_LOCKED, so infer this value from the bits passed in instead of taking it as an argument. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14btrfs: assert we have a write lock when removing and replacing extent mapsFilipe Manana
Removing or replacing an extent map requires holding a write lock on the extent map's tree. We currently do that everywhere, except in one of the self tests, where it's harmless since there's no concurrency. In order to catch possible races in the future, assert that we are holding a write lock on the extent map tree before removing or replacing an extent map in the tree, and update the self test to obtain a write lock before removing extent maps. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2022-02-23btrfs: defrag: don't use merged extent map for their generation checkQu Wenruo
For extent maps, if they are not compressed extents and are adjacent by logical addresses and file offsets, they can be merged into one larger extent map. Such merged extent map will have the higher generation of all the original ones. But this brings a problem for autodefrag, as it relies on accurate extent_map::generation to determine if one extent should be defragged. For merged extent maps, their higher generation can mark some older extents to be defragged while the original extent map doesn't meet the minimal generation threshold. Thus this will cause extra IO. So solve the problem, here we introduce a new flag, EXTENT_FLAG_MERGED, to indicate if the extent map is merged from one or more ems. And for autodefrag, if we find a merged extent map, and its generation meets the generation requirement, we just don't use this one, and go back to defrag_get_extent() to read extent maps from subvolume trees. This could cause more read IO, but should result less defrag data write, so in the long run it should be a win for autodefrag. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-10-26btrfs: rename btrfs_bio to btrfs_io_contextQu Wenruo
The structure btrfs_bio is used by two different sites: - bio->bi_private for mirror based profiles For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records how many mirrors are still pending, and save the original endio function of the bio. - RAID56 code In that case, RAID56 only utilize the stripes info, and no long uses that to trace the pending mirrors. So btrfs_bio is not always bind to a bio, and contains more info for IO context, thus renaming it will make the naming less confusing. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: fix parameter description of btrfs_add_extent_mappingNikolay Borisov
This fixes the following compiler warnings: fs/btrfs/extent_map.c:601: warning: Function parameter or member 'fs_info' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_tree' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_in' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'start' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'len' not described in 'btrfs_add_extent_mapping' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2021-02-08btrfs: document modified parameter of add_extent_mappingNikolay Borisov
Fixes fs/btrfs/extent_map.c:399: warning: Function parameter or member 'modified' not described in 'add_extent_mapping' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-12Btrfs: fix race between using extent maps and merging themFilipe Manana
We have a few cases where we allow an extent map that is in an extent map tree to be merged with other extents in the tree. Such cases include the unpinning of an extent after the respective ordered extent completed or after logging an extent during a fast fsync. This can lead to subtle and dangerous problems because when doing the merge some other task might be using the same extent map and as consequence see an inconsistent state of the extent map - for example sees the new length but has seen the old start offset. With luck this triggers a BUG_ON(), and not some silent bug, such as the following one in __do_readpage(): $ cat -n fs/btrfs/extent_io.c 3061 static int __do_readpage(struct extent_io_tree *tree, 3062 struct page *page, (...) 3127 em = __get_extent_map(inode, page, pg_offset, cur, 3128 end - cur + 1, get_extent, em_cached); 3129 if (IS_ERR_OR_NULL(em)) { 3130 SetPageError(page); 3131 unlock_extent(tree, cur, end); 3132 break; 3133 } 3134 extent_offset = cur - em->start; 3135 BUG_ON(extent_map_end(em) <= cur); (...) Consider the following example scenario, where we end up hitting the BUG_ON() in __do_readpage(). We have an inode with a size of 8KiB and 2 extent maps: extent A: file offset 0, length 4KiB, disk_bytenr = X, persisted on disk by a previous transaction extent B: file offset 4KiB, length 4KiB, disk_bytenr = X + 4KiB, not yet persisted but writeback started for it already. The extent map is pinned since there's writeback and an ordered extent in progress, so it can not be merged with extent map A yet The following sequence of steps leads to the BUG_ON(): 1) The ordered extent for extent B completes, the respective page gets its writeback bit cleared and the extent map is unpinned, at that point it is not yet merged with extent map A because it's in the list of modified extents; 2) Due to memory pressure, or some other reason, the MM subsystem releases the page corresponding to extent B - btrfs_releasepage() is called and returns 1, meaning the page can be released as it's not dirty, not under writeback anymore and the extent range is not locked in the inode's iotree. However the extent map is not released, either because we are not in a context that allows memory allocations to block or because the inode's size is smaller than 16MiB - in this case our inode has a size of 8KiB; 3) Task B needs to read extent B and ends up __do_readpage() through the btrfs_readpage() callback. At __do_readpage() it gets a reference to extent map B; 4) Task A, doing a fast fsync, calls clear_em_loggin() against extent map B while holding the write lock on the inode's extent map tree - this results in try_merge_map() being called and since it's possible to merge extent map B with extent map A now (the extent map B was removed from the list of modified extents), the merging begins - it sets extent map B's start offset to 0 (was 4KiB), but before it increments the map's length to 8KiB (4kb + 4KiB), task A is at: BUG_ON(extent_map_end(em) <= cur); The call to extent_map_end() sees the extent map has a start of 0 and a length still at 4KiB, so it returns 4KiB and 'cur' is 4KiB, so the BUG_ON() is triggered. So it's dangerous to modify an extent map that is in the tree, because some other task might have got a reference to it before and still using it, and needs to see a consistent map while using it. Generally this is very rare since most paths that lookup and use extent maps also have the file range locked in the inode's iotree. The fsync path is pretty much the only exception where we don't do it to avoid serialization with concurrent reads. Fix this by not allowing an extent map do be merged if if it's being used by tasks other then the one attempting to merge the extent map (when the reference count of the extent map is greater than 2). Reported-by: ryusuke1925 <st13s20@gm.ibaraki-ct.ac.jp> Reported-by: Koki Mitani <koki.mitani.xg@hco.ntt.co.jp> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206211 CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: remove extent_map::bdevDavid Sterba
We can now remove the bdev from extent_map. Previous patches made sure that bio_set_dev is correctly in all places and that we don't need to grab it from latest_bdev or pass it around inside the extent map. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: assert extent_map bdevs and lookup_map and splitDavid Sterba
This is a preparatory patch for removing extent_map::bdev. There's some history behind the code so this is only precaution to catch if things break before the actual removal happens. Logically, comparing a raw low-level block device (bdev) does not make sense for extent maps (high-level objects). This had no effect in practice but was quite confusing in the code. The lookup_map is set iff EXTENT_FLAG_FS_MAPPING is set. The two pointers were stored in the same bytes and used potentially in two meanings. Now they're split, so the asserts are in place to check that the condition will not change. The lookup map pointer misused bdev, this has been changed in commit 95617d69326c ("btrfs: cleanup, stop casting for extent_map->lookup everywhere") to the explicit type. But the semantics hasn't changed and bdev was not actually used to decide if maps are mergeable. Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: assert extent map tree lock in add_extent_mappingDavid Sterba
As add_extent_mapping is called from several functions, let's add the lock annotation. The tree is going to be modified so it must be the exclusive lock. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Optimize unallocated chunks discardNikolay Borisov
Currently unallocated chunks are always trimmed. For example 2 consecutive trims on large storage would trim freespace twice irrespective of whether the space was actually allocated or not between those trims. Optimise this behavior by exploiting the newly introduced alloc_state tree of btrfs_device. A new CHUNK_TRIMMED bit is used to mark those unallocated chunks which have been trimmed and have not been allocated afterwards. On chunk allocation the respective underlying devices' physical space will have its CHUNK_TRIMMED flag cleared. This avoids submitting discards for space which hasn't been changed since the last time discard was issued. This applies to the single mount period of the filesystem as the information is not stored permanently. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: replace pending/pinned chunks lists with io treeJeff Mahoney
The pending chunks list contains chunks that are allocated in the current transaction but haven't been created yet. The pinned chunks list contains chunks that are being released in the current transaction. Both describe chunks that are not reflected on disk as in use but are unavailable just the same. The pending chunks list is anchored by the transaction handle, which means that we need to hold a reference to a transaction when working with the list. The way we use them is by iterating over both lists to perform comparisons on the stripes they describe for each device. This is backwards and requires that we keep a transaction handle open while we're trimming. This patchset adds an extent_io_tree to btrfs_device that maintains the allocation state of the device. Extents are set dirty when chunks are first allocated -- when the extent maps are added to the mapping tree. They're cleared when last removed -- when the extent maps are removed from the mapping tree. This matches the lifespan of the pending and pinned chunks list and allows us to do trims on unallocated space safely without pinning the transaction for what may be a lengthy operation. We can also use this io tree to mark which chunks have already been trimmed so we don't repeat the operation. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-02-25btrfs: Remove impossible condition from mergable_mapsNikolay Borisov
We can never have extents marked as EXTENT_MAP_DELALLOC since this value is only ever used by btrfs_get_extent_fiemap. In this case the extent map is created by btrfs_get_extent_fiemap and is never really published, this flag is used to return the corresponding userspace one. Considering this, it's pointless having a check for EXTENT_MAP_DELALLOC in mergable_maps. Just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-12-17btrfs: Fix typos in comments and stringsAndrea Gelmini
The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15Btrfs: extent_map: use rb_first_cachedLiu Bo
rb_first_cached() trades an extra pointer "leftmost" for doing the same job as rb_first() but in O(1). As evict_inode_truncate_pages() removes all extent mapping by always looking for the first rb entry, it's helpful to use rb_first_cached instead. For more details about the optimization see patch "Btrfs: delayed-refs: use rb_first_cached for href_root". Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-10-15btrfs: change remove_extent_mapping to return voidzhong jiang
remove_extent_mapping uses the variable "ret" for return value, but it is not modified after initialzation. Further, I find that any of the callers do not handle the return value and the callees are only simple functions so the return values does not need to be passed. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-05-28btrfs: use fs_info for btrfs_handle_em_exist tracepointDavid Sterba
We really want to know to which filesystem the extent map events belong, but as it cannot be reached from the extent_map pointers, we need to pass it down the callchain. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-04-12btrfs: replace GPL boilerplate by SPDX -- sourcesDavid Sterba
Remove GPL boilerplate text (long, short, one-line) and keep the rest, ie. personal, company or original source copyright statements. Add the SPDX header. Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: add more __cold annotationsDavid Sterba
The __cold functions are placed to a special section, as they're expected to be called rarely. This could help i-cache prefetches or help compiler to decide which branches are more/less likely to be taken without any other annotations needed. Though we can't add more __exit annotations, it's still possible to add __cold (that's also added with __exit). That way the following function categories are tagged: - printf wrappers, error messages - exit helpers Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26Btrfs: add tracepoint for em's EEXIST caseLiu Bo
This is adding a tracepoint 'btrfs_handle_em_exist' to help debug the subtle bugs around merge_extent_mapping. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-03-26btrfs: remove unused hardirq.hYang Shi
Preempt counter APIs have been split out, currently, hardirq.h just includes irq_enter/exit APIs which are not used by btrfs at all. So, remove the unused hardirq.h. Signed-off-by: Yang Shi <yang.s@alibaba-inc.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22Btrfs: noinline merge_extent_mappingLiu Bo
In order to debug subtle bugs around merge_extent_mapping(), perf probe can be used to check the arguments, but sometimes merge_extent_mapping() got inlined by compiler and couldn't be probed. This is adding noinline attribute to merge_extent_mapping(). Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22Btrfs: add WARN_ONCE to detect unexpected error from merge_extent_mappingLiu Bo
This is a subtle case, so in order to understand the problem, it'd be good to know the content of existing and em when any error occurs. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2018-01-22Btrfs: move extent map specific code to extent_map.cLiu Bo
These helpers are extent map specific, move them to extent_map.c. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2017-11-02License cleanup: add SPDX GPL-2.0 license identifier to files with no licenseGreg Kroah-Hartman
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-04-18btrfs: convert extent_map.refs from atomic_t to refcount_tElena Reshetova
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-07-26btrfs: Fix slab accounting flagsNikolay Borisov
BTRFS is using a variety of slab caches to satisfy internal needs. Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT, meaning allocations from the caches are going to be accounted as SReclaimable. At the same time btrfs is not registering any shrinkers whatsoever, thus preventing memory from the slabs to be shrunk. This means those caches are not in fact reclaimable. To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the inode cache, since this one is being freed by the generic VFS super_block shrinker. Also set the transaction related caches as SLAB_TEMPORARY, to better document the lifetime of the objects (it just translates to SLAB_RECLAIM_ACCOUNT). Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-05-25btrfs: fix string and comment grammatical issues and typosNicholas D Steeves
Signed-off-by: Nicholas D Steeves <nsteeves@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-14btrfs: Fix misspellings in comments.Adam Buchbinder
Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-03-11btrfs: move btrfs_compression_type to compression.hAnand Jain
So that its better organized. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-02-18btrfs: drop null testing before destroy functionsKinglong Mee
Cleanup. kmem_cache_destroy has support NULL argument checking, so drop the double null testing before calling it. Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-15btrfs: cleanup, stop casting for extent_map->lookup everywhereJeff Mahoney
Overloading extent_map->bdev to struct map_lookup * might have started out as a means to an end, but it's a pattern that's used all over the place now. Let's get rid of the casting and just add a union instead. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2014-11-21Btrfs: do not move em to modified list when unpinningJosef Bacik
We use the modified list to keep track of which extents have been modified so we know which ones are candidates for logging at fsync() time. Newly modified extents are added to the list at modification time, around the same time the ordered extent is created. We do this so that we don't have to wait for ordered extents to complete before we know what we need to log. The problem is when something like this happens log extent 0-4k on inode 1 copy csum for 0-4k from ordered extent into log sync log commit transaction log some other extent on inode 1 ordered extent for 0-4k completes and adds itself onto modified list again log changed extents see ordered extent for 0-4k has already been logged at this point we assume the csum has been copied sync log crash On replay we will see the extent 0-4k in the log, drop the original 0-4k extent which is the same one that we are replaying which also drops the csum, and then we won't find the csum in the log for that bytenr. This of course causes us to have errors about not having csums for certain ranges of our inode. So remove the modified list manipulation in unpin_extent_cache, any modified extents should have been added well before now, and we don't want them re-logged. This fixes my test that I could reliably reproduce this problem with. Thanks, cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>