summaryrefslogtreecommitdiff
path: root/Allocator.mdwn
diff options
context:
space:
mode:
Diffstat (limited to 'Allocator.mdwn')
-rw-r--r--Allocator.mdwn55
1 files changed, 55 insertions, 0 deletions
diff --git a/Allocator.mdwn b/Allocator.mdwn
new file mode 100644
index 0000000..9bad84a
--- /dev/null
+++ b/Allocator.mdwn
@@ -0,0 +1,55 @@
+It's past time for the allocator to be redesigned, or at least heavily
+reconsidered. Nothing should be off the table at this point - we could
+conceivably even move away from bucket based allocation.
+
+Here's a list of considerations:
+
+* The bucket based allocator does have advantages. It's fast - most of the work
+ in the allocator is per bucket, which are typically 512k to 2M, even up to 8M.
+ Metadata is one key per bucket, and the alloc btree uses the btree key
+ cache, which is very good for performance - it means extents btree updates
+ don't have to do a second (relatively expensive) btree update to update alloc
+ info, they're only updating the key cache.
+
+* The bucket based allocator also lets us store and cheaply discard cached data,
+ by incrementing the generation number for that bucket.
+
+* Disadvantage: the bucket allocator requires us to have copygc, with everything
+ that entails.
+
+* It would be really nice to have backpointers at some point - copygc would
+ greatly benefit from having backpointers (currently we have to scan the entire
+ extents and reflink btrees to find data that needs to be moved), and erasure
+ coding could also make good use of backpointers.
+
+ If we added another btree for backpointers, it wouldn't make sense for it to
+ use the btree key cache, and thus extents btree updates would get more
+ expensive. Perhaps alloc keys could include a list of backpointers? They'd be
+ 32 bytes each, and bkeys can be a little short of 2k - that's only ~50
+ backpointers per bkey, which we'd easily overflow with small writes and large
+ buckets. If a secondary backpointers btree was just used as a fallback, it
+ would be getting used right when the performance impact matters most - small
+ writes.
+
+ We should investigate the actual performance overhead of adding another btree
+ update for backpointers - it might not be all that bad. Right now we're
+ looking at ~450k random updates per second on a single core, and that includes
+ the full `bch2_trans_commit()` path, and the most expensive things in that are
+ per transaction commit, not per btree update (e.g. getting the journal
+ reservation). Also, we'll hopefully be getting gap buffers soon, which will
+ improve btree update performance.
+
+The most pressing consideration though is that we need to eliminate, as much as
+possible, the periodic scanning the allocator thread(s) have to do to find
+available buckets. It's too easy to end up with bugs where this scanning is
+happening repeatedly (we have one now...), and it's a scalability issue and we
+shouldn't be doing it at all.
+
+Also, we would really like to get rid of the in memory array of buckets, this is
+another scalability issue. The in memory struct bucket now mostly just mirrors
+what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is
+`journal_seq`, which is the journal sequence number of the last btree update
+that touched that bucket. It's needed for knowing whether we need to flush the
+journal before allocating and writing to that bucket - perhaps it should just be
+added to `struct bch_alloc` and stored in the keys in the btree.
+