Snapshots in bcachefs: ====================== The plan for snapshots in bcachefs is radically different than anything that's been done previously in other filesystems, and the end result should be drastically more efficient than e.g. btrfs. Snapshots will be very cheap to create and keep around; in particular sparse snapshots (many snapshots with only a few changes each) will be very cheap. First thing to note - bcachefs the filesystem is built on top of bcachefs the key/value store; all filesystem objects are just key/value pairs indexed by inode:offset in a small number of btrees, and the btree code just treats the key as a large integer. If this was a relational database, we'd have an extents table, an inodes table, a dirents table, etc. To implement snapshots, the key will be extended so that the low 32 bits will now be a snapshot ID (this already exists in struct bpos, it's just unused by the existing code). Snapshots themselves will form a tree, where each entry in the tree has a 32 bit snapshot ID and - importantly - we enforce that child snapshots have a node ID less than their parent; thus, the root snapshot will have the id U32_MAX. Then, when looking up filesystem data, our search key will include the snapshot ID of the snapshot we're currently in. The btree lookup returns the first key greater than or equal to the search key, so a lookup will return either a key in the current snapshot (if it exists) or (in the common case) a key in some ancestor snapshot - the only additional work we have to do for snapshots is to check the snapshot ID of the key return against our current snapshot and iterate forwards until we find a key in the current snapshot or an ancestor. That's pretty much the bcachefs snapshot design in a nutshell, for normal keys. Given typical filesystem snapshot usage, I expect the overhead from linear scan over unrelated snapshots to be minimal to nonexistent for most users. There are instances where this won't be the case - i.e. using snapshots for a virtual machine base image; many virtual machines would have their own overwrites to the same part of the keyspace. Reflink - which is already implemented in bcachefs - doesn't suffer from this problem, so either we would recommend users use reflink for those cases, or ideally we would be able to detect this condition and transparently do something like reflink when we detect a high density of overwrites in part of the keyspace. Extents: ======== Extents, however, are more complicated - a key in a child snapshot always overwrites/replicas a key in an ancestor snapshot exactly, but extents can partially overlap in arbitrary ways. Some examples below will illustrate the problems that snapshots introduce, and then a proposed solution. A note: in bcachefs, extents are indexed by their end position, not their start. This is more natural given typical usage: when doing a read we want the first extent that ends after the start of the range we're reading from. In the examples below, 1 is the ancestor snapshot/extent and 2 is the child, and extents are listed in their sort order in the btree. Suppose we have these two extents: 1: |--------| 2: |---------| This is a problem: on lookup (for snapshot 2), we find the extent in 1 first, as we should - we do require data from it. But we cannot use all of that extent; 2 overwrites it, so we have to keep scanning forwards to find out how much of that extent we can use. But that forward scan has no stopping condition - extents are indexed by the end and there can be an arbitrary number of extents in unrelated snapshots in between. Suppose we add the restriction: a child extent cannot cross the boundary at the end of the ancestor extent, instead we split the child extent: 2: |----| 1: |--------| 2: |----| This is still a problem for lookups, for example: |------| |-------| |--------| |---------| |----------| We must also add the restriction: start of child cannot land in middle of ancestor. Then we have these extents. 1: |---| 2: |----| 1: |----| 2: |----| This appears workable - however, given that we can have an arbitrary number of extents in different snapshots at the same position, this means we'll have to either split an unbounded number of existing extents - or we could only split the extent that's visible in the child snapshot, but this seems to result in invariants that would be difficult or impossible to check: 5: |----| 4: |---||----| 3: |---------| 2: |---------| 1: |---------| A different approach: --------------------- The original problem arises because introducing snapshots broke the invariant that extents are (also) ordered by start of extent (which was a result of sorting by end of extend and not having overlapping extents). Suppose we added a method for iterating across keys in a btree node in order the extent start position. Back to the first example: 1: |--------| 2: |---------| Suppose we're reading from snapshot 2 at the position where extent 1 starts. Lookup returns extent 1, as it should - but we need to determine when to stop reading from extent 1. If we can scan forwards from extent 1, ordered by extent start position, we can scan forwards until we find an extent that overwrites 1 (is in a newer snapshot than 1 and a descendant of the snapshot we're reading), or until we see an extent that starts after the position where extent 1 ends. Example 2: 2: |----| 1: |--------------| Suppose we're reading at the position where extent 1 starts. Lookup will return extent 2, but we actually want extent 1: What we can do is notice that the extent lookup return starts after the lookup position - so then iterate backwards, ordered by extent start position, until we find an extent that starts at or before our search position. Example 3: 1: |----| 2: |--------------| Reading from snapshot 2, at the position where extent 1 starts. Lookup returns extent 1, but we actually want extent 2. Example 4: 2: |----| 1: |--------------| 3: |-----------------------| Reading from 3, at the start of extent 1: Then what? Lookup returns extent 2, then as in example 2 we scan backwards by start of extent pos to get extent 1. 2: |-----| 1: |----------------| 2: |-----| 1: |----------------| 2: |------------|