Fsck.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160


Fsck in bcachefs:
=================

This document is an overview of fsck in bcachefs, as well as what will need to
change for snapshots.

In bcachefs, the fsck code is only responsible for checking higher level
filesystem structure. Btree topology and allocation information are checked by
`btree_gc.c`, which doesn't need to be aware of snapshots. This reduces the
scope of fsck considerably.

What it does have to check is:

* For each extent, dirent and xattr, that a corresponding inode exists and is of
  the appropriate type.

* Dirents and xattrs indexed by hash, with linear probing. The fsck code is
  responsible for checking that they're stored at the correct index (in case
  there was a hash collision, every index between the hash value and the actual
  index must be occupied by keys or hash whiteouts) and that there are no
  duplicates.

* Dirents: check that target inode exists and matches dirent's `d_type`, and
  check the target inode's backpointer fields.

* Extents:
 * Check that extents do not overlap
 * Check that extents do not exist past inode's `i_size`
 * Count extents for each inode and check inode's `i_sectors`.

* Check that root directory exists

* Check that lost+found exists

* Check directory stucture: currently, this is done with a depth-first traversal
  from the root directory. We check for directories with multiple links, and
  then scan the inodes btree for directories that weren't found by the DFS.

* Check each inode's link count: At this point we know that every dirent exists
  in a directory (has a corresponding inode of the correct type), has a valid
  target, and is reachable (because we've checked that all directories are
  reachable). All that's left is to scan the dirents btree, counting the number
  of dirents that point to each inode number, and verifying that against each
  inode's link count.


Changes for snapshots:
======================

Subvolumes vs. snapshots:

Filesystem code has to know what subvolume it is in, in order to fetch the
current snapshot ID at the start of every btree transaction. Fsck code does not
know or care what subvolume a given key belongs to - a key may belong to multiple
subvolumes, if it's in a snapshot that's an interior node in the snapshots
tree.

Instead, when one key references another (e.g. a dirent that points to an
inode), it does the lookup for the inode by searching for an inode with a
snapshot ID equal to or an ancestor of the dirent key's snapshot ID.
 
IOW, snaphsot IDs represent (branchable) "points in time", and fsck wants to
check that every view of the filesystem, at every point in time, is consistent.

Checking the filesystem for consistency in this fashion is conceptually
straightforward - every key has a snapshot ID, and we use that when looking up
anything that it references. However, we'll have to be careful about how we
repair inconsistencies when the snapshot ID is an interior node:

If we repair an inconsistency by deleting a key, we need to ensure that key is
not referenced by child snapshots. For example, if we find an extent that does
not have an `S_ISREG` inode, currently we delete it. But we might find an extent
that doesn't have a corresponding inode in with a snapshot ID equal to or
ancestor to the extents snapshot ID, but does have an inode in a child snapshot.
If that were to occur, we should probably copy that inode to the extent's
snapshot ID.

We would like it to be as hard a rule as possible that fsck never deletes or
modifies keys that belong to interior node snapshot IDs, because when people are
using snapshots they'll generally want the data in that snapshot to always stay
intact and it would be rather bad form if data was torched due to a bug in fsck.

It may not be possible to adhere to this rule while fixing all filesystem
inconsistencies - e.g. if extents exist above an inode's `i_size`, then
generally speaking the right thing to do is to delete that extent. But if we
find inconsistencies in at interior node snapshot IDs, then perhaps the best
thing to do would be to just leave it, or only apply changes in leaf node
snapshot IDs. 

Algorithmic considerations:
===========================

We want bcachefs to scale well into the millions of snapshots. This means that
there can't generally be operations that have to run for each subvolume or
snapshot - the fsck algorithms have to work one key at a time, processing them
in natural btree order.

This means that the extents, dirents and xattrs passes need to use
`BTREE_ITER_ALL_SNAPSHOTS`, meaning that as they walk keys for a given inode,
they'll be seeing keys from every snapshot version mixed together - and there
will be multiple versions of the inode, in different snapshots, that they need
to be checked against.

There's a helper in the fsck code that these passes use, `walk_inode()`, that
fetches the inode for the current inode number when it changes. This will
probably change to load every copy of that inode, in each snapshot ID. Also,
where we have to maintain state we'll have to duplicate that state for every
snapshot ID that exists at that inode number.

We'll also need to know, for every key that we process, which snapshot IDs it is
visible in: so that e.g. the extents pass can correctly count `i_sectors`.

Note that "millions of snapshots" will not _typically_ mean "millions of snaphot
IDs at the same time" - when we allocate new inodes we pick inode numbers that
aren't in use in any other snapshot or subvolume, and most files in a filesytem
are created, written to once and then later atomically replaced with a rename
call. When we're dealing with a file that keys with many different snapshot IDs
we need to make sure we don't have any hidden O(n^2) algorithms, but if fsck
requires in the hundreds of megabytes of state to check these files with worst
case numbers of overwrites, that's not the end of the world.

## Checking directory structure:

The current approach of doing a DFS from the root directory still works with
snapshots, but algorithmically is too slow. If we have many sparse snapshots,
the DFS would recurse into each of those snapshots, walking directory structure
it has already seen in the original subvolume - but those snapshots could have
directory structure changes in one or two places, so it would have to. We need
an algorithm that iterators over keys, not filesystem structure.

Fortunately, since backpointers from inodes to dirents were recently added we
can take a different approach. We walk the inodes btree, and for every directory
inode we find we chase backpointers until we get to the root directory. Note
that since previous passes have already checked backpointers, we're really just
check for loops here. This will still check the same structure repeatedly, like
the DFS approach, but here we can solve it by adding every inum:snapshot tuple
to a hash table after we've verified it has a path to the root.

## Checking inode link counts:

This one will be tedious, but algorithmically doesn't change much. The existing
code walks the dirents btree in natural key order, and then for each dirent
increments a link count in a radix tree. After walking the dirents, we walk the
inodes btree and compare link counts we calculated to what's in the inode.

As before, we'll walk the dirents btree in natural key order. However, we can't
accumulate link counts without also looking at what actually exists in the inodes
btree: a dirent may point to multiple inodes with different snapshot IDs, and
which of those we increment also depends on which snapshots that dirent is
visible in (we'll already have a helper for that part, as discussed previously).

This means that instead of using a radix tree, we'll have to start by walking
the inodes btree, and for every inode we find creating a new in memory data
structure (eytzinger layout search tree, perhaps) indexed by inode
number:snapshot id and containing just the link count we'll be calculating.

Fortunately hardlinks are usually rare in actual practice. We can make use of
the fact that we've already checked inode backpointers and only include inodes
that actually have hardlinks in this table.