summaryrefslogtreecommitdiff
path: root/BtreeIterators.mdwn
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@gmail.com>2020-01-09 20:47:41 -0500
committerKent Overstreet <kent.overstreet@gmail.com>2020-01-09 20:47:41 -0500
commit8db496a95bc5ba9c618a4c9c250d4433ecdf68f3 (patch)
treed0a2d6bb9e261cb9cec433accdaf06ef857e3e54 /BtreeIterators.mdwn
parentd29fc4f7f90184f21a4fd6277d00634edfed5a88 (diff)
Btree iterators doc
Diffstat (limited to 'BtreeIterators.mdwn')
-rw-r--r--BtreeIterators.mdwn72
1 files changed, 72 insertions, 0 deletions
diff --git a/BtreeIterators.mdwn b/BtreeIterators.mdwn
new file mode 100644
index 0000000..6ae3cf7
--- /dev/null
+++ b/BtreeIterators.mdwn
@@ -0,0 +1,72 @@
+# Bcachefs btree iterators:
+
+Btree iterators in bcachefs implement more or less the usual operations for
+iterating over an ordered key/value store.
+
+Iterators may be unlocked arbitrarily, due to e.g. transaction restarts (due to
+lock ordering, failure to get a journal reservation, btree node splits, etc.).
+This means they must remember what they were pointing at, so that subsequent
+peek/next etc. operations behave correctly. Also, a given btree transaction may
+have multiple iterators in use at the same time; updates to via one iterator
+must not invalidate other iterators (but this is a relatively easy requirement
+compared to getting restartable iterators correct).
+
+Iterators may be used to iterate over keys or slots; for slots we iterate over
+every possible position, returning `KEY_TYPE_deleted` keys for empty slots. They
+can iterate forwards or backwards, and they can be used to iterate over extents
+or non-extents. Extents always have a nonzero size and have start and end
+positions; iterating over extents in slots mode (where we synthesize ranges to
+represent holes between keys in the btree) is used frequently by the filesystem
+IO code in bcachefs.
+
+## Primary iterator state:
+
+ * `iter.pos`: the current position of the iterator.
+
+ The iterator position may be changed directly with
+ `bch2_btree_iter_set_pos()`, and the various peek()/next()/prev() operations
+ all update the iterator position to match what they are returning.
+
+ * `iter.l[BTREE_MAX_DEPTH]` (struct `btree_iter_level`)
+
+ These contain, one for each level of the btree, a pointer to a btree node, an
+ iterator for that btree node (struct `btree_node_iter`, see bset.h), and a
+ lock sequence number. Thus, we have a full path from the root of the btree down
+ to a specific key in a specific btree leaf node.
+
+ Btree node iterators are much simpler: they can be treated as if they point
+ to a single key within a sorted array of keys. Because btree nodes have
+ multiple sorted arrays of code btree node iterators are internally
+ maintaining multiple pointers and sorting the results, but this is an
+ internal implementation detail.
+
+The most important invariant for btree iterators is that `iter.pos` must always
+be consistent with the btree node iterators: that is, for every btree node
+iterator, the key the node iterator points to must compare >= `iter.pos`, and
+the previous key must compare < `iter.pos`.
+
+That is - the node iterator must point to the correct position for inserting a
+new key at `iter.pos`.
+
+For the various peek()/next()/prev() operations, `iter.pos` will _usually_ be
+updated to point to the start of the key we're returning, and the leaf btree
+node iterator will _usually_ point to the key we're returning, unless we're at a
+hole - but there are exceptions for extents.
+
+## Extents:
+
+When returning an extent, `iter.pos` will typically be updated to point to the
+start of the extent we're returning (which is not the key the extent is indexed
+by; extents are indexed by where they end, not where they start).
+
+This may mean that when we return an extent, the leaf btree node iterator won't
+necessarily point to the key we're returning - due to our #1 invariant, and the
+possible existence of whiteouts (that we filter out when iterating and don't
+return to user code).
+
+Additionally, iter.pos won't be updated to point to the start of the extent
+we're returning when that would move iter.pos backwards (unless we're actually
+iterating backwards); this is because user code wants to see a monotonically
+increasing iterator position (as user code often uses `iter.pos` to track how
+far we've gotten/how much work we have to do).
+