Fsck.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125


Fsck in bcachefs:
=================

This document is an overview of fsck in bcachefs, as well as what will need to
change for snapshots.

In bcachefs, the fsck code is only responsible for checking higher level
filesystem structure. Btree topology and allocation information are checked by
`btree_gc.c`, which doesn't need to be aware of snapshots. This reduces the
scope of fsck considerably.

What it does have to check is:

* For each extent, dirent and xattr, that a corresponding inode exists and is of
  the appropriate type.

* Dirents and xattrs indexed by hash, with linear probing. The fsck code is
  responsible for checking that they're stored at the correct index (in case
  there was a hash collision, every index between the hash value and the actual
  index must be occupied by keys or hash whiteouts) and that there are no
  duplicates.

* Dirents: check that target inode exists and matches dirent's `d_type`, and
  check the target inode's backpointer fields.

* Extents:
 * Check that extents do not overlap
 * Check that extents do not exist past inode's `i_size`
 * Count extents for each inode and check inode's `i_sectors`.


* Check that root directory exists

* Check that lost+found exists

* Check directory stucture: currently, this is done with a depth-first traversal
  from the root directory. We check for directories with multiple links, and
  then scan the inodes btree for directories that weren't found by the DFS.

* Check each inode's link count: At this point we know that every dirent exists
  in a directory (has a corresponding inode of the correct type), has a valid
  target, and is reachable (because we've checked that all directories are
  reachable). All that's left is to scan the dirents btree, counting the number
  of dirents that point to each inode number, and verifying that against each
  inode's link count.


Changes for snapshots:
======================

## References between keys:

Filesystem code operates within a subvolume: at the start of a btree transaction
it looks up up the subvolume's snapshot ID, and passes that to the btree
iterator code so that lookups always return the newest key visible to that
snapshot ID.

The fsck code does not operate within a specific subvolume or snapshot ID - we
have to iterate over and process keys in natural btree order, for performance.
From fsck's point of view, any key that points to another key may now point to
multiple versions of that key in different snapshots.

For example, given a dirent that points to an inode, we search for the first
inode with a snapshot ID equal to or an ancestor of the snapshot ID of the
dirent - a normal `BTREE_ITER_FILTER_WHITEOUTS` lookup, but using the snapshot
ID of the dirent, not the subvolume - this key should exist, if the dirent
points to a valid inode. Additionally, the dirent also references inodes at that
position in snapshot IDs that are descendends of the dirent's snapshot ID -
provided a newer dirent in a snapshot ID equal to or ancestor of the inode's
snapshot ID doesn't exist.

## Repair:

If we repair an inconsistency by deleting a key, we need to ensure that key is
not referenced by child snapshots. For example, if we find an extent that does
not have an `S_ISREG` inode, currently we delete it. But we might find an extent
that doesn't have a corresponding inode in with a snapshot ID equal to or
ancestor to the extents snapshot ID, but does have an inode in a child snapshot.
If that were to occur, we should probably copy that inode to the extent's
snapshot ID.

We would like it to be as hard a rule as possible that fsck never deletes or
modifies keys that belong to interior node snapshot IDs, because when people are
using snapshots they'll generally want the data in that snapshot to always stay
intact and it would be rather bad form if data was torched due to a bug in fsck.

It may not be possible to adhere to this rule while fixing all filesystem
inconsistencies - e.g. if extents exist above an inode's `i_size`, then
generally speaking the right thing to do is to delete that extent. But if we
find inconsistencies in at interior node snapshot IDs, then perhaps the best
thing to do would be to just leave it, or only apply changes in leaf node
snapshot IDs. 

Algorithmic considerations:
===========================

We want bcachefs to scale well into the millions of snapshots. This means that
there can't generally be operations that have to run for each subvolume or
snapshot - the fsck algorithms have to work one key at a time, processing them
in natural btree order.

This means that the extents, dirents and xattrs passes need to use
`BTREE_ITER_ALL_SNAPSHOTS`, meaning that as they walk keys for a given inode,
they'll be seeing keys from every snapshot version mixed together - and there
will be multiple versions of the inode, in different snapshots, that they need
to be checked against.

There's a helper in the fsck code that these passes use, `walk_inode()`, that
fetches the inode for the current inode number when it changes. This will
probably change to load every copy of that inode, in each snapshot ID. Also,
where we have to maintain state we'll have to duplicate that state for every
snapshot ID that exists at that inode number.

We'll also need to know, for every key that we process, which snapshot IDs it is
visible in: so that e.g. the extents pass can correctly count `i_sectors`.

Note that "millions of snapshots" will not _typically_ mean "millions of snaphot
IDs at the same time" - when we allocate new inodes we pick inode numbers that
aren't in use in any other snapshot or subvolume, and most files in a filesytem
are created, written to once and then later atomically replaced with a rename
call. When we're dealing with a file that keys with many different snapshot IDs
we need to make sure we don't have any hidden O(n^2) algorithms, but if fsck
requires in the hundreds of megabytes of state to check these files with worst
case numbers of overwrites, that's not the end of the world.