1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
|
It's past time for the allocator to be redesigned, or at least heavily
reconsidered. Nothing should be off the table at this point - we could
conceivably even move away from bucket based allocation.
Here's a list of considerations:
* The bucket based allocator does have advantages. It's fast - most of the work
in the allocator is per bucket, which are typically 512k to 2M, even up to 8M.
Metadata is one key per bucket, and the alloc btree uses the btree key
cache, which is very good for performance - it means extents btree updates
don't have to do a second (relatively expensive) btree update to update alloc
info, they're only updating the key cache.
* The bucket based allocator also lets us store and cheaply discard cached data,
by incrementing the generation number for that bucket.
* Disadvantage: the bucket allocator requires us to have copygc, with everything
that entails.
* It would be really nice to have backpointers at some point - copygc would
greatly benefit from having backpointers (currently we have to scan the entire
extents and reflink btrees to find data that needs to be moved), and erasure
coding could also make good use of backpointers.
If we added another btree for backpointers, it wouldn't make sense for it to
use the btree key cache, and thus extents btree updates would get more
expensive. Perhaps alloc keys could include a list of backpointers? They'd be
32 bytes each, and bkeys can be a little short of 2k - that's only ~50
backpointers per bkey, which we'd easily overflow with small writes and large
buckets. If a secondary backpointers btree was just used as a fallback, it
would be getting used right when the performance impact matters most - small
writes.
We should investigate the actual performance overhead of adding another btree
update for backpointers - it might not be all that bad. Right now we're
looking at ~450k random updates per second on a single core, and that includes
the full `bch2_trans_commit()` path, and the most expensive things in that are
per transaction commit, not per btree update (e.g. getting the journal
reservation). Also, we'll hopefully be getting gap buffers soon, which will
improve btree update performance.
The most pressing consideration though is that we need to eliminate, as much as
possible, the periodic scanning the allocator thread(s) have to do to find
available buckets. It's too easy to end up with bugs where this scanning is
happening repeatedly (we have one now...), and it's a scalability issue and we
shouldn't be doing it at all.
Also, we would really like to get rid of the in memory array of buckets, this is
another scalability issue. The in memory struct bucket now mostly just mirrors
what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is
`journal_seq`, which is the journal sequence number of the last btree update
that touched that bucket. It's needed for knowing whether we need to flush the
journal before allocating and writing to that bucket - perhaps it should just be
added to `struct bch_alloc` and stored in the keys in the btree.
|