From dced3a18e426284548b6658d033ce15a78c13adc Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Wed, 7 Apr 2021 01:15:45 -0400 Subject: fsck & allocator design docs --- Allocator.mdwn | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) create mode 100644 Allocator.mdwn (limited to 'Allocator.mdwn') diff --git a/Allocator.mdwn b/Allocator.mdwn new file mode 100644 index 0000000..9bad84a --- /dev/null +++ b/Allocator.mdwn @@ -0,0 +1,55 @@ +It's past time for the allocator to be redesigned, or at least heavily +reconsidered. Nothing should be off the table at this point - we could +conceivably even move away from bucket based allocation. + +Here's a list of considerations: + +* The bucket based allocator does have advantages. It's fast - most of the work + in the allocator is per bucket, which are typically 512k to 2M, even up to 8M. + Metadata is one key per bucket, and the alloc btree uses the btree key + cache, which is very good for performance - it means extents btree updates + don't have to do a second (relatively expensive) btree update to update alloc + info, they're only updating the key cache. + +* The bucket based allocator also lets us store and cheaply discard cached data, + by incrementing the generation number for that bucket. + +* Disadvantage: the bucket allocator requires us to have copygc, with everything + that entails. + +* It would be really nice to have backpointers at some point - copygc would + greatly benefit from having backpointers (currently we have to scan the entire + extents and reflink btrees to find data that needs to be moved), and erasure + coding could also make good use of backpointers. + + If we added another btree for backpointers, it wouldn't make sense for it to + use the btree key cache, and thus extents btree updates would get more + expensive. Perhaps alloc keys could include a list of backpointers? They'd be + 32 bytes each, and bkeys can be a little short of 2k - that's only ~50 + backpointers per bkey, which we'd easily overflow with small writes and large + buckets. If a secondary backpointers btree was just used as a fallback, it + would be getting used right when the performance impact matters most - small + writes. + + We should investigate the actual performance overhead of adding another btree + update for backpointers - it might not be all that bad. Right now we're + looking at ~450k random updates per second on a single core, and that includes + the full `bch2_trans_commit()` path, and the most expensive things in that are + per transaction commit, not per btree update (e.g. getting the journal + reservation). Also, we'll hopefully be getting gap buffers soon, which will + improve btree update performance. + +The most pressing consideration though is that we need to eliminate, as much as +possible, the periodic scanning the allocator thread(s) have to do to find +available buckets. It's too easy to end up with bugs where this scanning is +happening repeatedly (we have one now...), and it's a scalability issue and we +shouldn't be doing it at all. + +Also, we would really like to get rid of the in memory array of buckets, this is +another scalability issue. The in memory struct bucket now mostly just mirrors +what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is +`journal_seq`, which is the journal sequence number of the last btree update +that touched that bucket. It's needed for knowing whether we need to flush the +journal before allocating and writing to that bucket - perhaps it should just be +added to `struct bch_alloc` and stored in the keys in the btree. + -- cgit v1.2.3