Allocator.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

It's past time for the allocator to be redesigned, or at least heavily
reconsidered. Nothing should be off the table at this point - we could
conceivably even move away from bucket based allocation.

Here's a list of considerations:

* The bucket based allocator does have advantages. It's fast - most of the work
  in the allocator is per bucket, which are typically 512k to 2M, even up to 8M.
  Metadata is one key per bucket, and the alloc btree uses the btree key
  cache, which is very good for performance - it means extents btree updates
  don't have to do a second (relatively expensive) btree update to update alloc
  info, they're only updating the key cache.

* The bucket based allocator also lets us store and cheaply discard cached data,
  by incrementing the generation number for that bucket.

* Disadvantage: the bucket allocator requires us to have copygc, with everything
  that entails.

* It would be really nice to have backpointers at some point - copygc would
  greatly benefit from having backpointers (currently we have to scan the entire
  extents and reflink btrees to find data that needs to be moved), and erasure
  coding could also make good use of backpointers.

  If we added another btree for backpointers, it wouldn't make sense for it to
  use the btree key cache, and thus extents btree updates would get more
  expensive. Perhaps alloc keys could include a list of backpointers? They'd be
  32 bytes each, and bkeys can be a little short of 2k - that's only ~50
  backpointers per bkey, which we'd easily overflow with small writes and large
  buckets. If a secondary backpointers btree was just used as a fallback, it
  would be getting used right when the performance impact matters most - small
  writes.

  We should investigate the actual performance overhead of adding another btree
  update for backpointers - it might not be all that bad. Right now we're
  looking at ~450k random updates per second on a single core, and that includes
  the full `bch2_trans_commit()` path, and the most expensive things in that are
  per transaction commit, not per btree update (e.g. getting the journal
  reservation). Also, we'll hopefully be getting gap buffers soon, which will
  improve btree update performance.

The most pressing consideration though is that we need to eliminate, as much as
possible, the periodic scanning the allocator thread(s) have to do to find
available buckets. It's too easy to end up with bugs where this scanning is
happening repeatedly (we have one now...), and it's a scalability issue and we
shouldn't be doing it at all.

Also, we would really like to get rid of the in memory array of buckets, this is
another scalability issue. The in memory struct bucket now mostly just mirrors
what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is
`journal_seq`, which is the journal sequence number of the last btree update
that touched that bucket. It's needed for knowing whether we need to flush the
journal before allocating and writing to that bucket - perhaps it should just be
added to `struct bch_alloc` and stored in the keys in the btree.