summaryrefslogtreecommitdiff
path: root/BtreeWhiteouts.mdwn
blob: b75646ae96dd3687a946ec41bb8b52f2c007e16d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Whiteout optimizations:
=======================

bcachefs btree nodes are log structured, and thus are also effectively a hybrid
compacting data structure. Most btree node keys are in large sorted lists of
keys that are too big to be efficiently inserted to or deleted from - but since
we're not doing random updates on them, we can build special data structures to
highly accelerate lookups.

Since we usually can't delete or update in place an existing key, this means we
need whiteouts, but whiteouts add their own complications. When we generate a
whiteout because we overwrite or delete an existing key, we may or may not need
to keep that whiteout around and write it out to disk: if the key we overwrote
had been written out to disk we need to ensure that we write something out to
disk that overwrites it. If we're updating that key - i.e. inserting a new
version of that key - then writing out the new key is sufficient, but if we were
deleting we need to ensure that that whiteout is kept around and written out to
disk.

To track this we have the flag `bkey.needs_whiteout`: keys have this flag set
when they've been written out to disk, or when they overwrote something that was
written out to disk.

Additionally: as mentioned, some whiteouts need to be retained until the next
btree node write, but we don't want to keep them mixed in with the rest of the
keys in a btree node where they'd have to be skipped over when we're iterating
over live keys. Therefore, when a given bset in a btree node has too many
whiteouts (as a fraction of the total amount of data in that bset), we do a
compact operation that drops whiteouts, saving the ones that need to be written
where the next btree node write will find it.

Extents
-------

Extents add their own complications. `KEY_TYPE_deleted` is used for whiteouts
for normal keys, and it's also used for extents that have been trimmed to 0
size. But a 0 size extent is meaningless - these can always be dropped.
`KEY_TYPE_discard` is used for nonzero size extent whiteouts - i.e. whiteouts
that actually do overwrite other extents.