snapshops WIP design doc

author: Kent Overstreet <kent.overstreet@gmail.com> 2019-11-19 16:05:26 -0500
committer: Kent Overstreet <kent.overstreet@gmail.com> 2019-11-19 16:05:26 -0500
commit: a5f1540e6ea2865b4f9538e36c9505829e5da0d4 (patch)
tree: b4784643d3c4e9512ad9e8121f90428649f94503
parent: ea3eb6d4894969cc940c8d9086212aa8bb124f4a (diff)
1 files changed, 166 insertions, 0 deletions
diff --git a/Snapshots.mdwn b/Snapshots.mdwn
new file mode 100644
index 0000000..324b6bf
--- /dev/null
+++ b/Snapshots.mdwn
@@ -0,0 +1,166 @@
+
+Snapshots in bcachefs:
+======================
+
+The plan for snapshots in bcachefs is radically different than anything that's
+been done previously in other filesystems, and the end result should be
+drastically more efficient than e.g. btrfs. Snapshots will be very cheap to
+create and keep around; in particular sparse snapshots (many snapshots with only
+a few changes each) will be very cheap.
+
+First thing to note - bcachefs the filesystem is built on top of bcachefs the
+key/value store; all filesystem objects are just key/value pairs indexed by
+inode:offset in a small number of btrees, and the btree code just treats the key
+as a large integer. If this was a relational database, we'd have an extents
+table, an inodes table, a dirents table, etc.
+
+To implement snapshots, the key will be extended so that the low 32 bits will
+now be a snapshot ID (this already exists in struct bpos, it's just unused by
+the existing code).
+
+Snapshots themselves will form a tree, where each entry in the tree has a 32 bit
+snapshot ID and - importantly - we enforce that child snapshots have a node ID
+less than their parent; thus, the root snapshot will have the id U32_MAX.
+
+Then, when looking up filesystem data, our search key will include the snapshot
+ID of the snapshot we're currently in. The btree lookup returns the first key
+greater than or equal to the search key, so a lookup will return either a key in
+the current snapshot (if it exists) or (in the common case) a key in some
+ancestor snapshot - the only additional work we have to do for snapshots is to
+check the snapshot ID of the key return against our current snapshot and iterate
+forwards until we find a key in the current snapshot or an ancestor.
+
+That's pretty much the bcachefs snapshot design in a nutshell, for normal keys.
+Given typical filesystem snapshot usage, I expect the overhead from linear scan
+over unrelated snapshots to be minimal to nonexistent for most users. There are
+instances where this won't be the case - i.e. using snapshots for a virtual
+machine base image; many virtual machines would have their own overwrites to the
+same part of the keyspace. Reflink - which is already implemented in bcachefs -
+doesn't suffer from this problem, so either we would recommend users use reflink
+for those cases, or ideally we would be able to detect this condition and
+transparently do something like reflink when we detect a high density of
+overwrites in part of the keyspace.
+
+Extents:
+========
+
+Extents, however, are more complicated - a key in a child snapshot always
+overwrites/replicas a key in an ancestor snapshot exactly, but extents can
+partially overlap in arbitrary ways. Some examples below will illustrate the
+problems that snapshots introduce, and then a proposed solution.
+
+A note: in bcachefs, extents are indexed by their end position, not their start.
+This is more natural given typical usage: when doing a read we want the first
+extent that ends after the start of the range we're reading from.
+
+In the examples below, 1 is the ancestor snapshot/extent and 2 is the child, and
+extents are listed in their sort order in the btree.
+
+Suppose we have these two extents:
+
+1:   |--------|
+2:       |---------|
+
+This is a problem: on lookup (for snapshot 2), we find the extent in 1 first, as
+we should - we do require data from it. But we cannot use all of that extent; 2
+overwrites it, so we have to keep scanning forwards to find out how much of that
+extent we can use. But that forward scan has no stopping condition - extents are
+indexed by the end and there can be an arbitrary number of extents in unrelated
+snapshots in between.
+
+Suppose we add the restriction: a child extent cannot cross the boundary at the
+end of the ancestor extent, instead we split the child extent:
+
+2:       |----|
+1:   |--------|
+2:            |----|
+
+This is still a problem for lookups, for example:
+
+    |------|
+   |-------|
+  |--------|
+ |---------|
+|----------|
+
+We must also add the restriction: start of child cannot land in middle of
+ancestor. Then we have these extents.
+
+1:   |---|
+2:       |----|
+1:       |----|
+2:            |----|
+
+This appears workable - however, given that we can have an arbitrary number of
+extents in different snapshots at the same position, this means we'll have to
+either split an unbounded number of existing extents - or we could only split
+the extent that's visible in the child snapshot, but this seems to result in
+invariants that would be difficult or impossible to check:
+
+5:      |----|
+4: |---||----|
+3: |---------|
+2: |---------|
+1: |---------|
+
+A different approach:
+---------------------
+
+The original problem arises because introducing snapshots broke the invariant
+that extents are (also) ordered by start of extent (which was a result of
+sorting by end of extend and not having overlapping extents). Suppose we added a
+method for iterating across keys in a btree node in order the extent start
+position.
+
+Back to the first example:
+
+1:   |--------|
+2:       |---------|
+
+Suppose we're reading from snapshot 2 at the position where extent 1 starts.
+Lookup returns extent 1, as it should - but we need to determine when to stop
+reading from extent 1. If we can scan forwards from extent 1, ordered by extent
+start position, we can scan forwards until we find an extent that overwrites 1
+(is in a newer snapshot than 1 and a descendant of the snapshot we're reading),
+or until we see an extent that starts after the position where extent 1 ends.
+
+* Example 2:
+
+2:      |----|
+1: |--------------|
+
+Suppose we're reading at the position where extent 1 starts. Lookup will return
+extent 2, but we actually want extent 1: What we can do is notice that the
+extent lookup return starts after the lookup position - so then iterate
+backwards, ordered by extent start position, until we find an extent that starts
+at or before our search position.
+
+* Example 3:
+
+1:      |----|
+2: |--------------|
+
+Reading from snapshot 2, at the position where extent 1 starts. Lookup returns
+extent 1, but we actually want extent 2.
+
+* Example 4:
+
+2:          |----|
+1:     |--------------|
+3: |-----------------------|
+
+Reading from 3, at the start of extent 1: Then what?
+
+Lookup returns extent 2, then as in example 2 we scan backwards by start of
+extent pos to get extent 1.
+
+
+
+
+
+2:      |-----|
+1: |----------------|
+
+2:      |-----|
+1: |----------------|
+2:            |------------|
author	Kent Overstreet <kent.overstreet@gmail.com>	2019-11-19 16:05:26 -0500
committer	Kent Overstreet <kent.overstreet@gmail.com>	2019-11-19 16:05:26 -0500
commit	a5f1540e6ea2865b4f9538e36c9505829e5da0d4 (patch)
tree	b4784643d3c4e9512ad9e8121f90428649f94503
parent	ea3eb6d4894969cc940c8d9086212aa8bb124f4a (diff)