From dced3a18e426284548b6658d033ce15a78c13adc Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Wed, 7 Apr 2021 01:15:45 -0400 Subject: fsck & allocator design docs --- Allocator.mdwn | 55 ++++++++++++++++++++ Fsck.mdwn | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ sidebar.mdwn | 2 + 3 files changed, 217 insertions(+) create mode 100644 Allocator.mdwn create mode 100644 Fsck.mdwn diff --git a/Allocator.mdwn b/Allocator.mdwn new file mode 100644 index 0000000..9bad84a --- /dev/null +++ b/Allocator.mdwn @@ -0,0 +1,55 @@ +It's past time for the allocator to be redesigned, or at least heavily +reconsidered. Nothing should be off the table at this point - we could +conceivably even move away from bucket based allocation. + +Here's a list of considerations: + +* The bucket based allocator does have advantages. It's fast - most of the work + in the allocator is per bucket, which are typically 512k to 2M, even up to 8M. + Metadata is one key per bucket, and the alloc btree uses the btree key + cache, which is very good for performance - it means extents btree updates + don't have to do a second (relatively expensive) btree update to update alloc + info, they're only updating the key cache. + +* The bucket based allocator also lets us store and cheaply discard cached data, + by incrementing the generation number for that bucket. + +* Disadvantage: the bucket allocator requires us to have copygc, with everything + that entails. + +* It would be really nice to have backpointers at some point - copygc would + greatly benefit from having backpointers (currently we have to scan the entire + extents and reflink btrees to find data that needs to be moved), and erasure + coding could also make good use of backpointers. + + If we added another btree for backpointers, it wouldn't make sense for it to + use the btree key cache, and thus extents btree updates would get more + expensive. Perhaps alloc keys could include a list of backpointers? They'd be + 32 bytes each, and bkeys can be a little short of 2k - that's only ~50 + backpointers per bkey, which we'd easily overflow with small writes and large + buckets. If a secondary backpointers btree was just used as a fallback, it + would be getting used right when the performance impact matters most - small + writes. + + We should investigate the actual performance overhead of adding another btree + update for backpointers - it might not be all that bad. Right now we're + looking at ~450k random updates per second on a single core, and that includes + the full `bch2_trans_commit()` path, and the most expensive things in that are + per transaction commit, not per btree update (e.g. getting the journal + reservation). Also, we'll hopefully be getting gap buffers soon, which will + improve btree update performance. + +The most pressing consideration though is that we need to eliminate, as much as +possible, the periodic scanning the allocator thread(s) have to do to find +available buckets. It's too easy to end up with bugs where this scanning is +happening repeatedly (we have one now...), and it's a scalability issue and we +shouldn't be doing it at all. + +Also, we would really like to get rid of the in memory array of buckets, this is +another scalability issue. The in memory struct bucket now mostly just mirrors +what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is +`journal_seq`, which is the journal sequence number of the last btree update +that touched that bucket. It's needed for knowing whether we need to flush the +journal before allocating and writing to that bucket - perhaps it should just be +added to `struct bch_alloc` and stored in the keys in the btree. + diff --git a/Fsck.mdwn b/Fsck.mdwn new file mode 100644 index 0000000..a6fc1d1 --- /dev/null +++ b/Fsck.mdwn @@ -0,0 +1,160 @@ + +Fsck in bcachefs: +================= + +This document is an overview of fsck in bcachefs, as well as what will need to +change for snapshots. + +In bcachefs, the fsck code is only responsible for checking higher level +filesystem structure. Btree topology and allocation information are checked by +`btree_gc.c`, which doesn't need to be aware of snapshots. This reduces the +scope of fsck considerably. + +What it does have to check is: + +* For each extent, dirent and xattr, that a corresponding inode exists and is of + the appropriate type. + +* Dirents and xattrs indexed by hash, with linear probing. The fsck code is + responsible for checking that they're stored at the correct index (in case + there was a hash collision, every index between the hash value and the actual + index must be occupied by keys or hash whiteouts) and that there are no + duplicates. + +* Dirents: check that target inode exists and matches dirent's `d_type`, and + check the target inode's backpointer fields. + +* Extents: + * Check that extents do not overlap + * Check that extents do not exist past inode's `i_size` + * Count extents for each inode and check inode's `i_sectors`. + +* Check that root directory exists + +* Check that lost+found exists + +* Check directory stucture: currently, this is done with a depth-first traversal + from the root directory. We check for directories with multiple links, and + then scan the inodes btree for directories that weren't found by the DFS. + +* Check each inode's link count: At this point we know that every dirent exists + in a directory (has a corresponding inode of the correct type), has a valid + target, and is reachable (because we've checked that all directories are + reachable). All that's left is to scan the dirents btree, counting the number + of dirents that point to each inode number, and verifying that against each + inode's link count. + + +Changes for snapshots: +====================== + +Subvolumes vs. snapshots: + +Filesystem code has to know what subvolume it is in, in order to fetch the +current snapshot ID at the start of every btree transaction. Fsck code does not +know or care what subvolume a given key belongs to - a key may belong to multiple +subvolumes, if it's in a snapshot that's an interior node in the snapshots +tree. + +Instead, when one key references another (e.g. a dirent that points to an +inode), it does the lookup for the inode by searching for an inode with a +snapshot ID equal to or an ancestor of the dirent key's snapshot ID. + +IOW, snaphsot IDs represent (branchable) "points in time", and fsck wants to +check that every view of the filesystem, at every point in time, is consistent. + +Checking the filesystem for consistency in this fashion is conceptually +straightforward - every key has a snapshot ID, and we use that when looking up +anything that it references. However, we'll have to be careful about how we +repair inconsistencies when the snapshot ID is an interior node: + +If we repair an inconsistency by deleting a key, we need to ensure that key is +not referenced by child snapshots. For example, if we find an extent that does +not have an `S_ISREG` inode, currently we delete it. But we might find an extent +that doesn't have a corresponding inode in with a snapshot ID equal to or +ancestor to the extents snapshot ID, but does have an inode in a child snapshot. +If that were to occur, we should probably copy that inode to the extent's +snapshot ID. + +We would like it to be as hard a rule as possible that fsck never deletes or +modifies keys that belong to interior node snapshot IDs, because when people are +using snapshots they'll generally want the data in that snapshot to always stay +intact and it would be rather bad form if data was torched due to a bug in fsck. + +It may not be possible to adhere to this rule while fixing all filesystem +inconsistencies - e.g. if extents exist above an inode's `i_size`, then +generally speaking the right thing to do is to delete that extent. But if we +find inconsistencies in at interior node snapshot IDs, then perhaps the best +thing to do would be to just leave it, or only apply changes in leaf node +snapshot IDs. + +Algorithmic considerations: +=========================== + +We want bcachefs to scale well into the millions of snapshots. This means that +there can't generally be operations that have to run for each subvolume or +snapshot - the fsck algorithms have to work one key at a time, processing them +in natural btree order. + +This means that the extents, dirents and xattrs passes need to use +`BTREE_ITER_ALL_SNAPSHOTS`, meaning that as they walk keys for a given inode, +they'll be seeing keys from every snapshot version mixed together - and there +will be multiple versions of the inode, in different snapshots, that they need +to be checked against. + +There's a helper in the fsck code that these passes use, `walk_inode()`, that +fetches the inode for the current inode number when it changes. This will +probably change to load every copy of that inode, in each snapshot ID. Also, +where we have to maintain state we'll have to duplicate that state for every +snapshot ID that exists at that inode number. + +We'll also need to know, for every key that we process, which snapshot IDs it is +visible in: so that e.g. the extents pass can correctly count `i_sectors`. + +Note that "millions of snapshots" will not _typically_ mean "millions of snaphot +IDs at the same time" - when we allocate new inodes we pick inode numbers that +aren't in use in any other snapshot or subvolume, and most files in a filesytem +are created, written to once and then later atomically replaced with a rename +call. When we're dealing with a file that keys with many different snapshot IDs +we need to make sure we don't have any hidden O(n^2) algorithms, but if fsck +requires in the hundreds of megabytes of state to check these files with worst +case numbers of overwrites, that's not the end of the world. + +## Checking directory structure: + +The current approach of doing a DFS from the root directory still works with +snapshots, but algorithmically is too slow. If we have many sparse snapshots, +the DFS would recurse into each of those snapshots, walking directory structure +it has already seen in the original subvolume - but those snapshots could have +directory structure changes in one or two places, so it would have to. We need +an algorithm that iterators over keys, not filesystem structure. + +Fortunately, since backpointers from inodes to dirents were recently added we +can take a different approach. We walk the inodes btree, and for every directory +inode we find we chase backpointers until we get to the root directory. Note +that since previous passes have already checked backpointers, we're really just +check for loops here. This will still check the same structure repeatedly, like +the DFS approach, but here we can solve it by adding every inum:snapshot tuple +to a hash table after we've verified it has a path to the root. + +## Checking inode link counts: + +This one will be tedious, but algorithmically doesn't change much. The existing +code walks the dirents btree in natural key order, and then for each dirent +increments a link count in a radix tree. After walking the dirents, we walk the +inodes btree and compare link counts we calculated to what's in the inode. + +As before, we'll walk the dirents btree in natural key order. However, we can't +accumulate link counts without also looking at what actually exists in the inodes +btree: a dirent may point to multiple inodes with different snapshot IDs, and +which of those we increment also depends on which snapshots that dirent is +visible in (we'll already have a helper for that part, as discussed previously). + +This means that instead of using a radix tree, we'll have to start by walking +the inodes btree, and for every inode we find creating a new in memory data +structure (eytzinger layout search tree, perhaps) indexed by inode +number:snapshot id and containing just the link count we'll be calculating. + +Fortunately hardlinks are usually rare in actual practice. We can make use of +the fact that we've already checked inode backpointers and only include inodes +that actually have hardlinks in this table. diff --git a/sidebar.mdwn b/sidebar.mdwn index ae7eee0..06cf820 100644 --- a/sidebar.mdwn +++ b/sidebar.mdwn @@ -16,5 +16,7 @@ * [[Encryption]] * [[Transactions]] * [[Snapshots]] + * [[Allocator]] + * [[Fsck]] * Other technical docs * [[StablePages]] -- cgit v1.2.3