diff options
Diffstat (limited to 'Fsck.mdwn')
-rw-r--r-- | Fsck.mdwn | 160 |
1 files changed, 160 insertions, 0 deletions
diff --git a/Fsck.mdwn b/Fsck.mdwn new file mode 100644 index 0000000..a6fc1d1 --- /dev/null +++ b/Fsck.mdwn @@ -0,0 +1,160 @@ + +Fsck in bcachefs: +================= + +This document is an overview of fsck in bcachefs, as well as what will need to +change for snapshots. + +In bcachefs, the fsck code is only responsible for checking higher level +filesystem structure. Btree topology and allocation information are checked by +`btree_gc.c`, which doesn't need to be aware of snapshots. This reduces the +scope of fsck considerably. + +What it does have to check is: + +* For each extent, dirent and xattr, that a corresponding inode exists and is of + the appropriate type. + +* Dirents and xattrs indexed by hash, with linear probing. The fsck code is + responsible for checking that they're stored at the correct index (in case + there was a hash collision, every index between the hash value and the actual + index must be occupied by keys or hash whiteouts) and that there are no + duplicates. + +* Dirents: check that target inode exists and matches dirent's `d_type`, and + check the target inode's backpointer fields. + +* Extents: + * Check that extents do not overlap + * Check that extents do not exist past inode's `i_size` + * Count extents for each inode and check inode's `i_sectors`. + +* Check that root directory exists + +* Check that lost+found exists + +* Check directory stucture: currently, this is done with a depth-first traversal + from the root directory. We check for directories with multiple links, and + then scan the inodes btree for directories that weren't found by the DFS. + +* Check each inode's link count: At this point we know that every dirent exists + in a directory (has a corresponding inode of the correct type), has a valid + target, and is reachable (because we've checked that all directories are + reachable). All that's left is to scan the dirents btree, counting the number + of dirents that point to each inode number, and verifying that against each + inode's link count. + + +Changes for snapshots: +====================== + +Subvolumes vs. snapshots: + +Filesystem code has to know what subvolume it is in, in order to fetch the +current snapshot ID at the start of every btree transaction. Fsck code does not +know or care what subvolume a given key belongs to - a key may belong to multiple +subvolumes, if it's in a snapshot that's an interior node in the snapshots +tree. + +Instead, when one key references another (e.g. a dirent that points to an +inode), it does the lookup for the inode by searching for an inode with a +snapshot ID equal to or an ancestor of the dirent key's snapshot ID. + +IOW, snaphsot IDs represent (branchable) "points in time", and fsck wants to +check that every view of the filesystem, at every point in time, is consistent. + +Checking the filesystem for consistency in this fashion is conceptually +straightforward - every key has a snapshot ID, and we use that when looking up +anything that it references. However, we'll have to be careful about how we +repair inconsistencies when the snapshot ID is an interior node: + +If we repair an inconsistency by deleting a key, we need to ensure that key is +not referenced by child snapshots. For example, if we find an extent that does +not have an `S_ISREG` inode, currently we delete it. But we might find an extent +that doesn't have a corresponding inode in with a snapshot ID equal to or +ancestor to the extents snapshot ID, but does have an inode in a child snapshot. +If that were to occur, we should probably copy that inode to the extent's +snapshot ID. + +We would like it to be as hard a rule as possible that fsck never deletes or +modifies keys that belong to interior node snapshot IDs, because when people are +using snapshots they'll generally want the data in that snapshot to always stay +intact and it would be rather bad form if data was torched due to a bug in fsck. + +It may not be possible to adhere to this rule while fixing all filesystem +inconsistencies - e.g. if extents exist above an inode's `i_size`, then +generally speaking the right thing to do is to delete that extent. But if we +find inconsistencies in at interior node snapshot IDs, then perhaps the best +thing to do would be to just leave it, or only apply changes in leaf node +snapshot IDs. + +Algorithmic considerations: +=========================== + +We want bcachefs to scale well into the millions of snapshots. This means that +there can't generally be operations that have to run for each subvolume or +snapshot - the fsck algorithms have to work one key at a time, processing them +in natural btree order. + +This means that the extents, dirents and xattrs passes need to use +`BTREE_ITER_ALL_SNAPSHOTS`, meaning that as they walk keys for a given inode, +they'll be seeing keys from every snapshot version mixed together - and there +will be multiple versions of the inode, in different snapshots, that they need +to be checked against. + +There's a helper in the fsck code that these passes use, `walk_inode()`, that +fetches the inode for the current inode number when it changes. This will +probably change to load every copy of that inode, in each snapshot ID. Also, +where we have to maintain state we'll have to duplicate that state for every +snapshot ID that exists at that inode number. + +We'll also need to know, for every key that we process, which snapshot IDs it is +visible in: so that e.g. the extents pass can correctly count `i_sectors`. + +Note that "millions of snapshots" will not _typically_ mean "millions of snaphot +IDs at the same time" - when we allocate new inodes we pick inode numbers that +aren't in use in any other snapshot or subvolume, and most files in a filesytem +are created, written to once and then later atomically replaced with a rename +call. When we're dealing with a file that keys with many different snapshot IDs +we need to make sure we don't have any hidden O(n^2) algorithms, but if fsck +requires in the hundreds of megabytes of state to check these files with worst +case numbers of overwrites, that's not the end of the world. + +## Checking directory structure: + +The current approach of doing a DFS from the root directory still works with +snapshots, but algorithmically is too slow. If we have many sparse snapshots, +the DFS would recurse into each of those snapshots, walking directory structure +it has already seen in the original subvolume - but those snapshots could have +directory structure changes in one or two places, so it would have to. We need +an algorithm that iterators over keys, not filesystem structure. + +Fortunately, since backpointers from inodes to dirents were recently added we +can take a different approach. We walk the inodes btree, and for every directory +inode we find we chase backpointers until we get to the root directory. Note +that since previous passes have already checked backpointers, we're really just +check for loops here. This will still check the same structure repeatedly, like +the DFS approach, but here we can solve it by adding every inum:snapshot tuple +to a hash table after we've verified it has a path to the root. + +## Checking inode link counts: + +This one will be tedious, but algorithmically doesn't change much. The existing +code walks the dirents btree in natural key order, and then for each dirent +increments a link count in a radix tree. After walking the dirents, we walk the +inodes btree and compare link counts we calculated to what's in the inode. + +As before, we'll walk the dirents btree in natural key order. However, we can't +accumulate link counts without also looking at what actually exists in the inodes +btree: a dirent may point to multiple inodes with different snapshot IDs, and +which of those we increment also depends on which snapshots that dirent is +visible in (we'll already have a helper for that part, as discussed previously). + +This means that instead of using a radix tree, we'll have to start by walking +the inodes btree, and for every inode we find creating a new in memory data +structure (eytzinger layout search tree, perhaps) indexed by inode +number:snapshot id and containing just the link count we'll be calculating. + +Fortunately hardlinks are usually rare in actual practice. We can make use of +the fact that we've already checked inode backpointers and only include inodes +that actually have hardlinks in this table. |