Snapshots.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135


Snapshots & subvolumes:
=======================

The short version:

bcachefs snapshots are a different approach - we're not using COW btrees.
Instead, they're based on extending the key filesystem items (inodes, dirents,
xattrs, extents) with a version number - a snapshot id - for the low bits.

Snapshot IDs form a tree, which will be recorded in the snapshots btree. The
root snapshot ID is `U32_MAX`, and we allocate new snapshot IDs growing down so
that the parent of a given snapshot ID is always larger than that ID.

To create a new writeable snapshot, we allocate two new snapshot IDs. We update
the existing subvolume with one of the new snapshot IDs, and assign the other
snapshot ID to the new snapshot.

When we do a lookup for a filesystem item, we have to check if the snapshot ID
of the key we found is an ancestor of the snapshot ID we're searching for, and
filter out items that aren't.

Subvolumes:
===========

Subvolumes are needed for two reasons:
 * They're the mechanism for accessing snapshots
 * Also, they're a way of isolating different parts of the filesystem heirarchy
   from snapshots, or taking snapshots that aren't global. I.e. you'd create a
   subvolume for your database so that filesystem snapshots don't COW it, or
   create subvolumes for each home directory so that users can snapshot their
   own home directories.

The functionality and userspace interface for snapshots and subvolumes are
roughly modelled after btrfs, but simplified.

Subvolumes in bcachefs are just fancy directories. We introduce internally a new
dirent type that can point to subvolumes, instead of inodes, and the subvolume
has a pointer to the root inode. Subvolumes get their own table (btree), and
subvolume keys have fields for root inode number and snapshot ID.

Subvolumes have no name outside of the filesystem heirarchy. This means that, in
order to enumerate and list subvolumes, we need to be able to reconstruct their
path.

To reconstruct paths, we're adding inode backpointers - two new inode fields for
the inode number of the directory they're in, and the dirent offset. We're only
adding fields for a single backpointer, i.e. we're not handling hardlinks yet -
we set an inode flag indicating the backpointer fields are untrusted whenever we
create a hardlink. If we do ever want to add multiple backpointers for files
with hardlinks, we'll need to add another btree where each backpointer gets its
own key. This should be good enough for now.

Subvolume root inodes have two more fields: one for the subvolume ID, and
another for the parent subvolume ID. The subvolume ID field is only set for
subvolume roots because otherwise taking a snapshot would require updating every
inode in that subvolume. With these fields and inode backpointers, we'll be able
to reconstruct a path to any directory, or any file that hasn't been hardlinked.

Snapshots:
==========

We're also adding another table (btree) for snapshot keys. Snapshot keys form a
tree where each node is just a u32. The btree iterator code that filters by
snapshot ID assumes that a parent IDs are always larger than child IDs, so the
root starts at `U32_MAX`. And, there will be multiple trees - creating a new
empty subvolume will allocate a new snapshot ID that has no parent node.

Any filesystem operation that's within a subvolume starts by looking up the key
for that subvolume to get the current snapshot ID, to be used for both lookups
and updates. This means we have to remember what subvolume we're in, in the in
memory `bch_inode_info` - as mentioned previously only subvolume root inodes
have this field in the btree.

The btree iterator code is getting two new flags - `BTREE_ITER_ALL_SNAPSHOTS`
and `BTREE_ITER_FILTER_SNAPSHOTS`, that controls how iteration handles the
snapshot field of the key. One of these flags should be specified for iterators
for a btree that uses the snapshots field. `BTREE_ITER_ALL_SNAPSHOTS` means
don't handle the snapshot field specially: it returns every key, and
advancing/rewinding the iterator position increments/decrements the snapshot
field. `BTREE_ITER_FILTER_SNAPSHOTS` means incrementing/decrementing the
iterator position does not include the snapshot field - there's only one
iterator position for each inode number:offset - and we return the key that
matches the specified snapshot ID, or the first ancestor if not found.

The update path, `bch2_trans_commit()` now has more work to do:

 * When deleting, we have to check if there's another key at the same position
   in an ancestor snapshot that would become visible - if so, we need to insert
   a whiteout instead.

 * When there was a key at the current position in an ancestor snapshot ID
   that's being overwritten (i.e. it will no longer be visible in the current
   snapshot) - we should also check if it's visible in any other snapshots, and
   if not, delete it.

Extents turn out to not require much special handling: extent handling has
already been moved out of the core btree code, and there's now a pass at the top
of `bch2_trans_commit()` for extents that walks the list of updates and compares
them to what's being overwritten, and for extents being partially overwritten
generates more updates for the existing extents. This all does basically what we
want for snapshots and won't have to change much.

There are implications for splitting and merging of existing extents. If we have
to split an existing extent (i.e. when copygc is moving around existing data and
the write path had to fragment it) - for each child snapshot ID of the extent's
snapshot ID, if the extent is not visible in that snapshot emit a whiteout for
the fragment.

Conversely, existing extents may not be merged if one of them is visible in a
child snapshot and the other is not.

Snapshot deletion:
==================

In the current design, deleting a snapshot will require walking every btree that
has snapshots (extents, inodes, dirents and xattrs) to find and delete keys with
the given snapshot ID. It would be nice to improve this.

Permissions:
============

Creating a new empty subvolume can be done by untrusted users anywhere they
could call mkdir().

Creating a snapshot will also be an untrusted operation - the only additional
requirement being that untrusted users must own the root of the subvolume being
snapshotted.

Fsck:
=====

The fsck code is going to have to be reworked to use `BTREE_ITER_ALL_SNAPSHOTS`
and check keys in all snapshots all at once, in order to have acceptable
performance. This has not been started on yet.