Snapshots.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166


Snapshots in bcachefs:
======================

The plan for snapshots in bcachefs is radically different than anything that's
been done previously in other filesystems, and the end result should be
drastically more efficient than e.g. btrfs. Snapshots will be very cheap to
create and keep around; in particular sparse snapshots (many snapshots with only
a few changes each) will be very cheap.

First thing to note - bcachefs the filesystem is built on top of bcachefs the
key/value store; all filesystem objects are just key/value pairs indexed by
inode:offset in a small number of btrees, and the btree code just treats the key
as a large integer. If this was a relational database, we'd have an extents
table, an inodes table, a dirents table, etc.

To implement snapshots, the key will be extended so that the low 32 bits will
now be a snapshot ID (this already exists in struct bpos, it's just unused by
the existing code).

Snapshots themselves will form a tree, where each entry in the tree has a 32 bit
snapshot ID and - importantly - we enforce that child snapshots have a node ID
less than their parent; thus, the root snapshot will have the id U32_MAX.

Then, when looking up filesystem data, our search key will include the snapshot
ID of the snapshot we're currently in. The btree lookup returns the first key
greater than or equal to the search key, so a lookup will return either a key in
the current snapshot (if it exists) or (in the common case) a key in some
ancestor snapshot - the only additional work we have to do for snapshots is to
check the snapshot ID of the key return against our current snapshot and iterate
forwards until we find a key in the current snapshot or an ancestor.

That's pretty much the bcachefs snapshot design in a nutshell, for normal keys.
Given typical filesystem snapshot usage, I expect the overhead from linear scan
over unrelated snapshots to be minimal to nonexistent for most users. There are
instances where this won't be the case - i.e. using snapshots for a virtual
machine base image; many virtual machines would have their own overwrites to the
same part of the keyspace. Reflink - which is already implemented in bcachefs -
doesn't suffer from this problem, so either we would recommend users use reflink
for those cases, or ideally we would be able to detect this condition and
transparently do something like reflink when we detect a high density of
overwrites in part of the keyspace.

Extents:
========

Extents, however, are more complicated - a key in a child snapshot always
overwrites/replicas a key in an ancestor snapshot exactly, but extents can
partially overlap in arbitrary ways. Some examples below will illustrate the
problems that snapshots introduce, and then a proposed solution.

A note: in bcachefs, extents are indexed by their end position, not their start.
This is more natural given typical usage: when doing a read we want the first
extent that ends after the start of the range we're reading from.

In the examples below, 1 is the ancestor snapshot/extent and 2 is the child, and
extents are listed in their sort order in the btree.

Suppose we have these two extents:

    1:   |--------|
    2:       |---------|

This is a problem: on lookup (for snapshot 2), we find the extent in 1 first, as
we should - we do require data from it. But we cannot use all of that extent; 2
overwrites it, so we have to keep scanning forwards to find out how much of that
extent we can use. But that forward scan has no stopping condition - extents are
indexed by the end and there can be an arbitrary number of extents in unrelated
snapshots in between.

Suppose we add the restriction: a child extent cannot cross the boundary at the
end of the ancestor extent, instead we split the child extent:

    2:       |----|
    1:   |--------|
    2:            |----|

This is still a problem for lookups, for example:

        |------|
       |-------|
      |--------|
     |---------|
    |----------|

We must also add the restriction: start of child cannot land in middle of
ancestor. Then we have these extents.

    1:   |---|
    2:       |----|
    1:       |----|
    2:            |----|

This appears workable - however, given that we can have an arbitrary number of
extents in different snapshots at the same position, this means we'll have to
either split an unbounded number of existing extents - or we could only split
the extent that's visible in the child snapshot, but this seems to result in
invariants that would be difficult or impossible to check:

    5:      |----|
    4: |---||----|
    3: |---------|
    2: |---------|
    1: |---------|

A different approach:
---------------------

The original problem arises because introducing snapshots broke the invariant
that extents are (also) ordered by start of extent (which was a result of
sorting by end of extend and not having overlapping extents). Suppose we added a
method for iterating across keys in a btree node in order the extent start
position.

Back to the first example:

    1:   |--------|
    2:       |---------|

Suppose we're reading from snapshot 2 at the position where extent 1 starts.
Lookup returns extent 1, as it should - but we need to determine when to stop
reading from extent 1. If we can scan forwards from extent 1, ordered by extent
start position, we can scan forwards until we find an extent that overwrites 1
(is in a newer snapshot than 1 and a descendant of the snapshot we're reading),
or until we see an extent that starts after the position where extent 1 ends.

Example 2:

    2:      |----|
    1: |--------------|

Suppose we're reading at the position where extent 1 starts. Lookup will return
extent 2, but we actually want extent 1: What we can do is notice that the
extent lookup return starts after the lookup position - so then iterate
backwards, ordered by extent start position, until we find an extent that starts
at or before our search position.

Example 3:

    1:      |----|
    2: |--------------|

Reading from snapshot 2, at the position where extent 1 starts. Lookup returns
extent 1, but we actually want extent 2.

Example 4:

    2:          |----|
    1:     |--------------|
    3: |-----------------------|

Reading from 3, at the start of extent 1: Then what?

Lookup returns extent 2, then as in example 2 we scan backwards by start of
extent pos to get extent 1.


2:      |-----|
1: |----------------|

2:      |-----|
1: |----------------|
2:            |------------|