From 8db496a95bc5ba9c618a4c9c250d4433ecdf68f3 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Thu, 9 Jan 2020 20:47:41 -0500 Subject: Btree iterators doc --- BtreeIterators.mdwn | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 72 insertions(+) create mode 100644 BtreeIterators.mdwn (limited to 'BtreeIterators.mdwn') diff --git a/BtreeIterators.mdwn b/BtreeIterators.mdwn new file mode 100644 index 0000000..6ae3cf7 --- /dev/null +++ b/BtreeIterators.mdwn @@ -0,0 +1,72 @@ +# Bcachefs btree iterators: + +Btree iterators in bcachefs implement more or less the usual operations for +iterating over an ordered key/value store. + +Iterators may be unlocked arbitrarily, due to e.g. transaction restarts (due to +lock ordering, failure to get a journal reservation, btree node splits, etc.). +This means they must remember what they were pointing at, so that subsequent +peek/next etc. operations behave correctly. Also, a given btree transaction may +have multiple iterators in use at the same time; updates to via one iterator +must not invalidate other iterators (but this is a relatively easy requirement +compared to getting restartable iterators correct). + +Iterators may be used to iterate over keys or slots; for slots we iterate over +every possible position, returning `KEY_TYPE_deleted` keys for empty slots. They +can iterate forwards or backwards, and they can be used to iterate over extents +or non-extents. Extents always have a nonzero size and have start and end +positions; iterating over extents in slots mode (where we synthesize ranges to +represent holes between keys in the btree) is used frequently by the filesystem +IO code in bcachefs. + +## Primary iterator state: + + * `iter.pos`: the current position of the iterator. + + The iterator position may be changed directly with + `bch2_btree_iter_set_pos()`, and the various peek()/next()/prev() operations + all update the iterator position to match what they are returning. + + * `iter.l[BTREE_MAX_DEPTH]` (struct `btree_iter_level`) + + These contain, one for each level of the btree, a pointer to a btree node, an + iterator for that btree node (struct `btree_node_iter`, see bset.h), and a + lock sequence number. Thus, we have a full path from the root of the btree down + to a specific key in a specific btree leaf node. + + Btree node iterators are much simpler: they can be treated as if they point + to a single key within a sorted array of keys. Because btree nodes have + multiple sorted arrays of code btree node iterators are internally + maintaining multiple pointers and sorting the results, but this is an + internal implementation detail. + +The most important invariant for btree iterators is that `iter.pos` must always +be consistent with the btree node iterators: that is, for every btree node +iterator, the key the node iterator points to must compare >= `iter.pos`, and +the previous key must compare < `iter.pos`. + +That is - the node iterator must point to the correct position for inserting a +new key at `iter.pos`. + +For the various peek()/next()/prev() operations, `iter.pos` will _usually_ be +updated to point to the start of the key we're returning, and the leaf btree +node iterator will _usually_ point to the key we're returning, unless we're at a +hole - but there are exceptions for extents. + +## Extents: + +When returning an extent, `iter.pos` will typically be updated to point to the +start of the extent we're returning (which is not the key the extent is indexed +by; extents are indexed by where they end, not where they start). + +This may mean that when we return an extent, the leaf btree node iterator won't +necessarily point to the key we're returning - due to our #1 invariant, and the +possible existence of whiteouts (that we filter out when iterating and don't +return to user code). + +Additionally, iter.pos won't be updated to point to the start of the extent +we're returning when that would move iter.pos backwards (unless we're actually +iterating backwards); this is because user code wants to see a monotonically +increasing iterator position (as user code often uses `iter.pos` to track how +far we've gotten/how much work we have to do). + -- cgit v1.2.3