From dced3a18e426284548b6658d033ce15a78c13adc Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@gmail.com>
Date: Wed, 7 Apr 2021 01:15:45 -0400
Subject: fsck & allocator design docs

---
 Allocator.mdwn |  55 ++++++++++++++++++++
 Fsck.mdwn      | 160 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 sidebar.mdwn   |   2 +
 3 files changed, 217 insertions(+)
 create mode 100644 Allocator.mdwn
 create mode 100644 Fsck.mdwn

diff --git a/Allocator.mdwn b/Allocator.mdwn
new file mode 100644
index 0000000..9bad84a
--- /dev/null
+++ b/Allocator.mdwn
@@ -0,0 +1,55 @@
+It's past time for the allocator to be redesigned, or at least heavily
+reconsidered. Nothing should be off the table at this point - we could
+conceivably even move away from bucket based allocation.
+
+Here's a list of considerations:
+
+* The bucket based allocator does have advantages. It's fast - most of the work
+  in the allocator is per bucket, which are typically 512k to 2M, even up to 8M.
+  Metadata is one key per bucket, and the alloc btree uses the btree key
+  cache, which is very good for performance - it means extents btree updates
+  don't have to do a second (relatively expensive) btree update to update alloc
+  info, they're only updating the key cache.
+
+* The bucket based allocator also lets us store and cheaply discard cached data,
+  by incrementing the generation number for that bucket.
+
+* Disadvantage: the bucket allocator requires us to have copygc, with everything
+  that entails.
+
+* It would be really nice to have backpointers at some point - copygc would
+  greatly benefit from having backpointers (currently we have to scan the entire
+  extents and reflink btrees to find data that needs to be moved), and erasure
+  coding could also make good use of backpointers.
+
+  If we added another btree for backpointers, it wouldn't make sense for it to
+  use the btree key cache, and thus extents btree updates would get more
+  expensive. Perhaps alloc keys could include a list of backpointers? They'd be
+  32 bytes each, and bkeys can be a little short of 2k - that's only ~50
+  backpointers per bkey, which we'd easily overflow with small writes and large
+  buckets. If a secondary backpointers btree was just used as a fallback, it
+  would be getting used right when the performance impact matters most - small
+  writes.
+
+  We should investigate the actual performance overhead of adding another btree
+  update for backpointers - it might not be all that bad. Right now we're
+  looking at ~450k random updates per second on a single core, and that includes
+  the full `bch2_trans_commit()` path, and the most expensive things in that are
+  per transaction commit, not per btree update (e.g. getting the journal
+  reservation). Also, we'll hopefully be getting gap buffers soon, which will
+  improve btree update performance.
+
+The most pressing consideration though is that we need to eliminate, as much as
+possible, the periodic scanning the allocator thread(s) have to do to find
+available buckets. It's too easy to end up with bugs where this scanning is
+happening repeatedly (we have one now...), and it's a scalability issue and we
+shouldn't be doing it at all.
+
+Also, we would really like to get rid of the in memory array of buckets, this is
+another scalability issue. The in memory struct bucket now mostly just mirrors
+what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is
+`journal_seq`, which is the journal sequence number of the last btree update
+that touched that bucket. It's needed for knowing whether we need to flush the
+journal before allocating and writing to that bucket - perhaps it should just be
+added to `struct bch_alloc` and stored in the keys in the btree.
+
diff --git a/Fsck.mdwn b/Fsck.mdwn
new file mode 100644
index 0000000..a6fc1d1
--- /dev/null
+++ b/Fsck.mdwn
@@ -0,0 +1,160 @@
+
+Fsck in bcachefs:
+=================
+
+This document is an overview of fsck in bcachefs, as well as what will need to
+change for snapshots.
+
+In bcachefs, the fsck code is only responsible for checking higher level
+filesystem structure. Btree topology and allocation information are checked by
+`btree_gc.c`, which doesn't need to be aware of snapshots. This reduces the
+scope of fsck considerably.
+
+What it does have to check is:
+
+* For each extent, dirent and xattr, that a corresponding inode exists and is of
+  the appropriate type.
+
+* Dirents and xattrs indexed by hash, with linear probing. The fsck code is
+  responsible for checking that they're stored at the correct index (in case
+  there was a hash collision, every index between the hash value and the actual
+  index must be occupied by keys or hash whiteouts) and that there are no
+  duplicates.
+
+* Dirents: check that target inode exists and matches dirent's `d_type`, and
+  check the target inode's backpointer fields.
+
+* Extents:
+ * Check that extents do not overlap
+ * Check that extents do not exist past inode's `i_size`
+ * Count extents for each inode and check inode's `i_sectors`.
+
+* Check that root directory exists
+
+* Check that lost+found exists
+
+* Check directory stucture: currently, this is done with a depth-first traversal
+  from the root directory. We check for directories with multiple links, and
+  then scan the inodes btree for directories that weren't found by the DFS.
+
+* Check each inode's link count: At this point we know that every dirent exists
+  in a directory (has a corresponding inode of the correct type), has a valid
+  target, and is reachable (because we've checked that all directories are
+  reachable). All that's left is to scan the dirents btree, counting the number
+  of dirents that point to each inode number, and verifying that against each
+  inode's link count.
+
+
+Changes for snapshots:
+======================
+
+Subvolumes vs. snapshots:
+
+Filesystem code has to know what subvolume it is in, in order to fetch the
+current snapshot ID at the start of every btree transaction. Fsck code does not
+know or care what subvolume a given key belongs to - a key may belong to multiple
+subvolumes, if it's in a snapshot that's an interior node in the snapshots
+tree.
+
+Instead, when one key references another (e.g. a dirent that points to an
+inode), it does the lookup for the inode by searching for an inode with a
+snapshot ID equal to or an ancestor of the dirent key's snapshot ID.
+ 
+IOW, snaphsot IDs represent (branchable) "points in time", and fsck wants to
+check that every view of the filesystem, at every point in time, is consistent.
+
+Checking the filesystem for consistency in this fashion is conceptually
+straightforward - every key has a snapshot ID, and we use that when looking up
+anything that it references. However, we'll have to be careful about how we
+repair inconsistencies when the snapshot ID is an interior node:
+
+If we repair an inconsistency by deleting a key, we need to ensure that key is
+not referenced by child snapshots. For example, if we find an extent that does
+not have an `S_ISREG` inode, currently we delete it. But we might find an extent
+that doesn't have a corresponding inode in with a snapshot ID equal to or
+ancestor to the extents snapshot ID, but does have an inode in a child snapshot.
+If that were to occur, we should probably copy that inode to the extent's
+snapshot ID.
+
+We would like it to be as hard a rule as possible that fsck never deletes or
+modifies keys that belong to interior node snapshot IDs, because when people are
+using snapshots they'll generally want the data in that snapshot to always stay
+intact and it would be rather bad form if data was torched due to a bug in fsck.
+
+It may not be possible to adhere to this rule while fixing all filesystem
+inconsistencies - e.g. if extents exist above an inode's `i_size`, then
+generally speaking the right thing to do is to delete that extent. But if we
+find inconsistencies in at interior node snapshot IDs, then perhaps the best
+thing to do would be to just leave it, or only apply changes in leaf node
+snapshot IDs. 
+
+Algorithmic considerations:
+===========================
+
+We want bcachefs to scale well into the millions of snapshots. This means that
+there can't generally be operations that have to run for each subvolume or
+snapshot - the fsck algorithms have to work one key at a time, processing them
+in natural btree order.
+
+This means that the extents, dirents and xattrs passes need to use
+`BTREE_ITER_ALL_SNAPSHOTS`, meaning that as they walk keys for a given inode,
+they'll be seeing keys from every snapshot version mixed together - and there
+will be multiple versions of the inode, in different snapshots, that they need
+to be checked against.
+
+There's a helper in the fsck code that these passes use, `walk_inode()`, that
+fetches the inode for the current inode number when it changes. This will
+probably change to load every copy of that inode, in each snapshot ID. Also,
+where we have to maintain state we'll have to duplicate that state for every
+snapshot ID that exists at that inode number.
+
+We'll also need to know, for every key that we process, which snapshot IDs it is
+visible in: so that e.g. the extents pass can correctly count `i_sectors`.
+
+Note that "millions of snapshots" will not _typically_ mean "millions of snaphot
+IDs at the same time" - when we allocate new inodes we pick inode numbers that
+aren't in use in any other snapshot or subvolume, and most files in a filesytem
+are created, written to once and then later atomically replaced with a rename
+call. When we're dealing with a file that keys with many different snapshot IDs
+we need to make sure we don't have any hidden O(n^2) algorithms, but if fsck
+requires in the hundreds of megabytes of state to check these files with worst
+case numbers of overwrites, that's not the end of the world.
+
+## Checking directory structure:
+
+The current approach of doing a DFS from the root directory still works with
+snapshots, but algorithmically is too slow. If we have many sparse snapshots,
+the DFS would recurse into each of those snapshots, walking directory structure
+it has already seen in the original subvolume - but those snapshots could have
+directory structure changes in one or two places, so it would have to. We need
+an algorithm that iterators over keys, not filesystem structure.
+
+Fortunately, since backpointers from inodes to dirents were recently added we
+can take a different approach. We walk the inodes btree, and for every directory
+inode we find we chase backpointers until we get to the root directory. Note
+that since previous passes have already checked backpointers, we're really just
+check for loops here. This will still check the same structure repeatedly, like
+the DFS approach, but here we can solve it by adding every inum:snapshot tuple
+to a hash table after we've verified it has a path to the root.
+
+## Checking inode link counts:
+
+This one will be tedious, but algorithmically doesn't change much. The existing
+code walks the dirents btree in natural key order, and then for each dirent
+increments a link count in a radix tree. After walking the dirents, we walk the
+inodes btree and compare link counts we calculated to what's in the inode.
+
+As before, we'll walk the dirents btree in natural key order. However, we can't
+accumulate link counts without also looking at what actually exists in the inodes
+btree: a dirent may point to multiple inodes with different snapshot IDs, and
+which of those we increment also depends on which snapshots that dirent is
+visible in (we'll already have a helper for that part, as discussed previously).
+
+This means that instead of using a radix tree, we'll have to start by walking
+the inodes btree, and for every inode we find creating a new in memory data
+structure (eytzinger layout search tree, perhaps) indexed by inode
+number:snapshot id and containing just the link count we'll be calculating.
+
+Fortunately hardlinks are usually rare in actual practice. We can make use of
+the fact that we've already checked inode backpointers and only include inodes
+that actually have hardlinks in this table.
diff --git a/sidebar.mdwn b/sidebar.mdwn
index ae7eee0..06cf820 100644
--- a/sidebar.mdwn
+++ b/sidebar.mdwn
@@ -16,5 +16,7 @@
   * [[Encryption]]
   * [[Transactions]]
   * [[Snapshots]]
+  * [[Allocator]]
+  * [[Fsck]]
 * Other technical docs
   * [[StablePages]]
-- 
cgit v1.2.3