1 files changed, 55 insertions, 0 deletions
diff --git a/Allocator.mdwn b/Allocator.mdwn
new file mode 100644
index 0000000..9bad84a
--- /dev/null
+++ b/Allocator.mdwn
@@ -0,0 +1,55 @@
+It's past time for the allocator to be redesigned, or at least heavily
+reconsidered. Nothing should be off the table at this point - we could
+conceivably even move away from bucket based allocation.
+
+Here's a list of considerations:
+
+* The bucket based allocator does have advantages. It's fast - most of the work
+  in the allocator is per bucket, which are typically 512k to 2M, even up to 8M.
+  Metadata is one key per bucket, and the alloc btree uses the btree key
+  cache, which is very good for performance - it means extents btree updates
+  don't have to do a second (relatively expensive) btree update to update alloc
+  info, they're only updating the key cache.
+
+* The bucket based allocator also lets us store and cheaply discard cached data,
+  by incrementing the generation number for that bucket.
+
+* Disadvantage: the bucket allocator requires us to have copygc, with everything
+  that entails.
+
+* It would be really nice to have backpointers at some point - copygc would
+  greatly benefit from having backpointers (currently we have to scan the entire
+  extents and reflink btrees to find data that needs to be moved), and erasure
+  coding could also make good use of backpointers.
+
+  If we added another btree for backpointers, it wouldn't make sense for it to
+  use the btree key cache, and thus extents btree updates would get more
+  expensive. Perhaps alloc keys could include a list of backpointers? They'd be
+  32 bytes each, and bkeys can be a little short of 2k - that's only ~50
+  backpointers per bkey, which we'd easily overflow with small writes and large
+  buckets. If a secondary backpointers btree was just used as a fallback, it
+  would be getting used right when the performance impact matters most - small
+  writes.
+
+  We should investigate the actual performance overhead of adding another btree
+  update for backpointers - it might not be all that bad. Right now we're
+  looking at ~450k random updates per second on a single core, and that includes
+  the full `bch2_trans_commit()` path, and the most expensive things in that are
+  per transaction commit, not per btree update (e.g. getting the journal
+  reservation). Also, we'll hopefully be getting gap buffers soon, which will
+  improve btree update performance.
+
+The most pressing consideration though is that we need to eliminate, as much as
+possible, the periodic scanning the allocator thread(s) have to do to find
+available buckets. It's too easy to end up with bugs where this scanning is
+happening repeatedly (we have one now...), and it's a scalability issue and we
+shouldn't be doing it at all.
+
+Also, we would really like to get rid of the in memory array of buckets, this is
+another scalability issue. The in memory struct bucket now mostly just mirrors
+what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is
+`journal_seq`, which is the journal sequence number of the last btree update
+that touched that bucket. It's needed for knowing whether we need to flush the
+journal before allocating and writing to that bucket - perhaps it should just be
+added to `struct bch_alloc` and stored in the keys in the btree.
+