From a5f1540e6ea2865b4f9538e36c9505829e5da0d4 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Tue, 19 Nov 2019 16:05:26 -0500 Subject: snapshops WIP design doc --- Snapshots.mdwn | 166 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 Snapshots.mdwn diff --git a/Snapshots.mdwn b/Snapshots.mdwn new file mode 100644 index 0000000..324b6bf --- /dev/null +++ b/Snapshots.mdwn @@ -0,0 +1,166 @@ + +Snapshots in bcachefs: +====================== + +The plan for snapshots in bcachefs is radically different than anything that's +been done previously in other filesystems, and the end result should be +drastically more efficient than e.g. btrfs. Snapshots will be very cheap to +create and keep around; in particular sparse snapshots (many snapshots with only +a few changes each) will be very cheap. + +First thing to note - bcachefs the filesystem is built on top of bcachefs the +key/value store; all filesystem objects are just key/value pairs indexed by +inode:offset in a small number of btrees, and the btree code just treats the key +as a large integer. If this was a relational database, we'd have an extents +table, an inodes table, a dirents table, etc. + +To implement snapshots, the key will be extended so that the low 32 bits will +now be a snapshot ID (this already exists in struct bpos, it's just unused by +the existing code). + +Snapshots themselves will form a tree, where each entry in the tree has a 32 bit +snapshot ID and - importantly - we enforce that child snapshots have a node ID +less than their parent; thus, the root snapshot will have the id U32_MAX. + +Then, when looking up filesystem data, our search key will include the snapshot +ID of the snapshot we're currently in. The btree lookup returns the first key +greater than or equal to the search key, so a lookup will return either a key in +the current snapshot (if it exists) or (in the common case) a key in some +ancestor snapshot - the only additional work we have to do for snapshots is to +check the snapshot ID of the key return against our current snapshot and iterate +forwards until we find a key in the current snapshot or an ancestor. + +That's pretty much the bcachefs snapshot design in a nutshell, for normal keys. +Given typical filesystem snapshot usage, I expect the overhead from linear scan +over unrelated snapshots to be minimal to nonexistent for most users. There are +instances where this won't be the case - i.e. using snapshots for a virtual +machine base image; many virtual machines would have their own overwrites to the +same part of the keyspace. Reflink - which is already implemented in bcachefs - +doesn't suffer from this problem, so either we would recommend users use reflink +for those cases, or ideally we would be able to detect this condition and +transparently do something like reflink when we detect a high density of +overwrites in part of the keyspace. + +Extents: +======== + +Extents, however, are more complicated - a key in a child snapshot always +overwrites/replicas a key in an ancestor snapshot exactly, but extents can +partially overlap in arbitrary ways. Some examples below will illustrate the +problems that snapshots introduce, and then a proposed solution. + +A note: in bcachefs, extents are indexed by their end position, not their start. +This is more natural given typical usage: when doing a read we want the first +extent that ends after the start of the range we're reading from. + +In the examples below, 1 is the ancestor snapshot/extent and 2 is the child, and +extents are listed in their sort order in the btree. + +Suppose we have these two extents: + +1: |--------| +2: |---------| + +This is a problem: on lookup (for snapshot 2), we find the extent in 1 first, as +we should - we do require data from it. But we cannot use all of that extent; 2 +overwrites it, so we have to keep scanning forwards to find out how much of that +extent we can use. But that forward scan has no stopping condition - extents are +indexed by the end and there can be an arbitrary number of extents in unrelated +snapshots in between. + +Suppose we add the restriction: a child extent cannot cross the boundary at the +end of the ancestor extent, instead we split the child extent: + +2: |----| +1: |--------| +2: |----| + +This is still a problem for lookups, for example: + + |------| + |-------| + |--------| + |---------| +|----------| + +We must also add the restriction: start of child cannot land in middle of +ancestor. Then we have these extents. + +1: |---| +2: |----| +1: |----| +2: |----| + +This appears workable - however, given that we can have an arbitrary number of +extents in different snapshots at the same position, this means we'll have to +either split an unbounded number of existing extents - or we could only split +the extent that's visible in the child snapshot, but this seems to result in +invariants that would be difficult or impossible to check: + +5: |----| +4: |---||----| +3: |---------| +2: |---------| +1: |---------| + +A different approach: +--------------------- + +The original problem arises because introducing snapshots broke the invariant +that extents are (also) ordered by start of extent (which was a result of +sorting by end of extend and not having overlapping extents). Suppose we added a +method for iterating across keys in a btree node in order the extent start +position. + +Back to the first example: + +1: |--------| +2: |---------| + +Suppose we're reading from snapshot 2 at the position where extent 1 starts. +Lookup returns extent 1, as it should - but we need to determine when to stop +reading from extent 1. If we can scan forwards from extent 1, ordered by extent +start position, we can scan forwards until we find an extent that overwrites 1 +(is in a newer snapshot than 1 and a descendant of the snapshot we're reading), +or until we see an extent that starts after the position where extent 1 ends. + +* Example 2: + +2: |----| +1: |--------------| + +Suppose we're reading at the position where extent 1 starts. Lookup will return +extent 2, but we actually want extent 1: What we can do is notice that the +extent lookup return starts after the lookup position - so then iterate +backwards, ordered by extent start position, until we find an extent that starts +at or before our search position. + +* Example 3: + +1: |----| +2: |--------------| + +Reading from snapshot 2, at the position where extent 1 starts. Lookup returns +extent 1, but we actually want extent 2. + +* Example 4: + +2: |----| +1: |--------------| +3: |-----------------------| + +Reading from 3, at the start of extent 1: Then what? + +Lookup returns extent 2, then as in example 2 we scan backwards by start of +extent pos to get extent 1. + + + + + +2: |-----| +1: |----------------| + +2: |-----| +1: |----------------| +2: |------------| -- cgit v1.2.3