From 25f80f663c9e49180ee740b06d7b72ee86899665 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@gmail.com>
Date: Tue, 9 Mar 2021 16:32:26 -0500
Subject: New snapshots design doc

---
 Snapshots.mdwn | 299 ++++++++++++++++++++++++++-------------------------------
 1 file changed, 134 insertions(+), 165 deletions(-)

diff --git a/Snapshots.mdwn b/Snapshots.mdwn
index c5d4527..bac5e54 100644
--- a/Snapshots.mdwn
+++ b/Snapshots.mdwn
@@ -1,166 +1,135 @@
 
-Snapshots in bcachefs:
-======================
-
-The plan for snapshots in bcachefs is radically different than anything that's
-been done previously in other filesystems, and the end result should be
-drastically more efficient than e.g. btrfs. Snapshots will be very cheap to
-create and keep around; in particular sparse snapshots (many snapshots with only
-a few changes each) will be very cheap.
-
-First thing to note - bcachefs the filesystem is built on top of bcachefs the
-key/value store; all filesystem objects are just key/value pairs indexed by
-inode:offset in a small number of btrees, and the btree code just treats the key
-as a large integer. If this was a relational database, we'd have an extents
-table, an inodes table, a dirents table, etc.
-
-To implement snapshots, the key will be extended so that the low 32 bits will
-now be a snapshot ID (this already exists in struct bpos, it's just unused by
-the existing code).
-
-Snapshots themselves will form a tree, where each entry in the tree has a 32 bit
-snapshot ID and - importantly - we enforce that child snapshots have a node ID
-less than their parent; thus, the root snapshot will have the id U32_MAX.
-
-Then, when looking up filesystem data, our search key will include the snapshot
-ID of the snapshot we're currently in. The btree lookup returns the first key
-greater than or equal to the search key, so a lookup will return either a key in
-the current snapshot (if it exists) or (in the common case) a key in some
-ancestor snapshot - the only additional work we have to do for snapshots is to
-check the snapshot ID of the key return against our current snapshot and iterate
-forwards until we find a key in the current snapshot or an ancestor.
-
-That's pretty much the bcachefs snapshot design in a nutshell, for normal keys.
-Given typical filesystem snapshot usage, I expect the overhead from linear scan
-over unrelated snapshots to be minimal to nonexistent for most users. There are
-instances where this won't be the case - i.e. using snapshots for a virtual
-machine base image; many virtual machines would have their own overwrites to the
-same part of the keyspace. Reflink - which is already implemented in bcachefs -
-doesn't suffer from this problem, so either we would recommend users use reflink
-for those cases, or ideally we would be able to detect this condition and
-transparently do something like reflink when we detect a high density of
-overwrites in part of the keyspace.
-
-Extents:
-========
-
-Extents, however, are more complicated - a key in a child snapshot always
-overwrites/replicas a key in an ancestor snapshot exactly, but extents can
-partially overlap in arbitrary ways. Some examples below will illustrate the
-problems that snapshots introduce, and then a proposed solution.
-
-A note: in bcachefs, extents are indexed by their end position, not their start.
-This is more natural given typical usage: when doing a read we want the first
-extent that ends after the start of the range we're reading from.
-
-In the examples below, 1 is the ancestor snapshot/extent and 2 is the child, and
-extents are listed in their sort order in the btree.
-
-Suppose we have these two extents:
-
-    1:   |--------|
-    2:       |---------|
-
-This is a problem: on lookup (for snapshot 2), we find the extent in 1 first, as
-we should - we do require data from it. But we cannot use all of that extent; 2
-overwrites it, so we have to keep scanning forwards to find out how much of that
-extent we can use. But that forward scan has no stopping condition - extents are
-indexed by the end and there can be an arbitrary number of extents in unrelated
-snapshots in between.
-
-Suppose we add the restriction: a child extent cannot cross the boundary at the
-end of the ancestor extent, instead we split the child extent:
-
-    2:       |----|
-    1:   |--------|
-    2:            |----|
-
-This is still a problem for lookups, for example:
-
-        |------|
-       |-------|
-      |--------|
-     |---------|
-    |----------|
-
-We must also add the restriction: start of child cannot land in middle of
-ancestor. Then we have these extents.
-
-    1:   |---|
-    2:       |----|
-    1:       |----|
-    2:            |----|
-
-This appears workable - however, given that we can have an arbitrary number of
-extents in different snapshots at the same position, this means we'll have to
-either split an unbounded number of existing extents - or we could only split
-the extent that's visible in the child snapshot, but this seems to result in
-invariants that would be difficult or impossible to check:
-
-    5:      |----|
-    4: |---||----|
-    3: |---------|
-    2: |---------|
-    1: |---------|
-
-A different approach:
----------------------
-
-The original problem arises because introducing snapshots broke the invariant
-that extents are (also) ordered by start of extent (which was a result of
-sorting by end of extend and not having overlapping extents). Suppose we added a
-method for iterating across keys in a btree node in order the extent start
-position.
-
-Back to the first example:
-
-    1:   |--------|
-    2:       |---------|
-
-Suppose we're reading from snapshot 2 at the position where extent 1 starts.
-Lookup returns extent 1, as it should - but we need to determine when to stop
-reading from extent 1. If we can scan forwards from extent 1, ordered by extent
-start position, we can scan forwards until we find an extent that overwrites 1
-(is in a newer snapshot than 1 and a descendant of the snapshot we're reading),
-or until we see an extent that starts after the position where extent 1 ends.
-
-Example 2:
-
-    2:      |----|
-    1: |--------------|
-
-Suppose we're reading at the position where extent 1 starts. Lookup will return
-extent 2, but we actually want extent 1: What we can do is notice that the
-extent lookup return starts after the lookup position - so then iterate
-backwards, ordered by extent start position, until we find an extent that starts
-at or before our search position.
-
-Example 3:
-
-    1:      |----|
-    2: |--------------|
-
-Reading from snapshot 2, at the position where extent 1 starts. Lookup returns
-extent 1, but we actually want extent 2.
-
-Example 4:
-
-    2:          |----|
-    1:     |--------------|
-    3: |-----------------------|
-
-Reading from 3, at the start of extent 1: Then what?
-
-Lookup returns extent 2, then as in example 2 we scan backwards by start of
-extent pos to get extent 1.
-
-
-
-
-
-    2:      |-----|
-    1: |----------------|
-
-    2:      |-----|
-    1: |----------------|
-    2:            |------------|
+Snapshots & subvolumes:
+=======================
+
+The short version:
+
+bcachefs snapshots are a different approach - we're not using COW btrees.
+Instead, they're based on extending the key filesystem items (inodes, dirents,
+xattrs, extents) with a version number - a snapshot id - for the low bits.
+
+Snapshot IDs form a tree, which will be recorded in the snapshots btree. The
+root snapshot ID is `U32_MAX`, and we allocate new snapshot IDs growing down so
+that the parent of a given snapshot ID is always larger than that ID.
+
+To create a new writeable snapshot, we allocate two new snapshot IDs. We update
+the existing subvolume with one of the new snapshot IDs, and assign the other
+snapshot ID to the new snapshot.
+
+When we do a lookup for a filesystem item, we have to check if the snapshot ID
+of the key we found is an ancestor of the snapshot ID we're searching for, and
+filter out items that aren't.
+
+Subvolumes:
+===========
+
+Subvolumes are needed for two reasons:
+ * They're the mechanism for accessing snapshots
+ * Also, they're a way of isolating different parts of the filesystem heirarchy
+   from snapshots, or taking snapshots that aren't global. I.e. you'd create a
+   subvolume for your database so that filesystem snapshots don't COW it, or
+   create subvolumes for each home directory so that users can snapshot their
+   own home directories.
+
+The functionality and userspace interface for snapshots and subvolumes are
+roughly modelled after btrfs, but simplified.
+
+Subvolumes in bcachefs are just fancy directories. We introduce internally a new
+dirent type that can point to subvolumes, instead of inodes, and the subvolume
+has a pointer to the root inode. Subvolumes get their own table (btree), and
+subvolume keys have fields for root inode number and snapshot ID.
+
+Subvolumes have no name outside of the filesystem heirarchy. This means that, in
+order to enumerate and list subvolumes, we need to be able to reconstruct their
+path.
+
+To reconstruct paths, we're adding inode backpointers - two new inode fields for
+the inode number of the directory they're in, and the dirent offset. We're only
+adding fields for a single backpointer, i.e. we're not handling hardlinks yet -
+we set an inode flag indicating the backpointer fields are untrusted whenever we
+create a hardlink. If we do ever want to add multiple backpointers for files
+with hardlinks, we'll need to add another btree where each backpointer gets its
+own key. This should be good enough for now.
+
+Subvolume root inodes have two more fields: one for the subvolume ID, and
+another for the parent subvolume ID. The subvolume ID field is only set for
+subvolume roots because otherwise taking a snapshot would require updating every
+inode in that subvolume. With these fields and inode backpointers, we'll be able
+to reconstruct a path to any directory, or any file that hasn't been hardlinked.
+
+Snapshots:
+==========
+
+We're also adding another table (btree) for snapshot keys. Snapshot keys form a
+tree where each node is just a u32. The btree iterator code that filters by
+snapshot ID assumes that a parent IDs are always larger than child IDs, so the
+root starts at `U32_MAX`. And, there will be multiple trees - creating a new
+empty subvolume will allocate a new snapshot ID that has no parent node.
+
+Any filesystem operation that's within a subvolume starts by looking up the key
+for that subvolume to get the current snapshot ID, to be used for both lookups
+and updates. This means we have to remember what subvolume we're in, in the in
+memory `bch_inode_info` - as mentioned previously only subvolume root inodes
+have this field in the btree.
+
+The btree iterator code is getting two new flags - `BTREE_ITER_ALL_SNAPSHOTS`
+and `BTREE_ITER_FILTER_SNAPSHOTS`, that controls how iteration handles the
+snapshot field of the key. One of these flags should be specified for iterators
+for a btree that uses the snapshots field. `BTREE_ITER_ALL_SNAPSHOTS` means
+don't handle the snapshot field specially: it returns every key, and
+advancing/rewinding the iterator position increments/decrements the snapshot
+field. `BTREE_ITER_FILTER_SNAPSHOTS` means incrementing/decrementing the
+iterator position does not include the snapshot field - there's only one
+iterator position for each inode number:offset - and we return the key that
+matches the specified snapshot ID, or the first ancestor if not found.
+
+The update path, `bch2_trans_commit()` now has more work to do:
+
+ * When deleting, we have to check if there's another key at the same position
+   in an ancestor snapshot that would become visible - if so, we need to insert
+   a whiteout instead.
+
+ * When there was a key at the current position in an ancestor snapshot ID
+   that's being overwritten (i.e. it will no longer be visible in the current
+   snapshot) - we should also check if it's visible in any other snapshots, and
+   if not, delete it.
+
+Extents turn out to not require much special handling: extent handling has
+already been moved out of the core btree code, and there's now a pass at the top
+of `bch2_trans_commit()` for extents that walks the list of updates and compares
+them to what's being overwritten, and for extents being partially overwritten
+generates more updates for the existing extents. This all does basically what we
+want for snapshots and won't have to change much.
+
+There are implications for splitting and merging of existing extents. If we have
+to split an existing extent (i.e. when copygc is moving around existing data and
+the write path had to fragment it) - for each child snapshot ID of the extent's
+snapshot ID, if the extent is not visible in that snapshot emit a whiteout for
+the fragment.
+
+Conversely, existing extents may not be merged if one of them is visible in a
+child snapshot and the other is not.
+
+Snapshot deletion:
+==================
+
+In the current design, deleting a snapshot will require walking every btree that
+has snapshots (extents, inodes, dirents and xattrs) to find and delete keys with
+the given snapshot ID. It would be nice to improve this.
+
+Permissions:
+============
+
+Creating a new empty subvolume can be done by untrusted users anywhere they
+could call mkdir().
+
+Creating a snapshot will also be an untrusted operation - the only additional
+requirement being that untrusted users must own the root of the subvolume being
+snapshotted.
+
+Fsck:
+=====
+
+The fsck code is going to have to be reworked to use `BTREE_ITER_ALL_SNAPSHOTS`
+and check keys in all snapshots all at once, in order to have acceptable
+performance. This has not been started on yet.
-- 
cgit v1.2.3