From ea3eb6d4894969cc940c8d9086212aa8bb124f4a Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@gmail.com>
Date: Sat, 12 Oct 2019 13:46:24 -0400
Subject: btree nodes doc

---
 BtreeNodes.mdwn | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100644 BtreeNodes.mdwn

(limited to 'BtreeNodes.mdwn')

diff --git a/BtreeNodes.mdwn b/BtreeNodes.mdwn
new file mode 100644
index 0000000..8b01643
--- /dev/null
+++ b/BtreeNodes.mdwn
@@ -0,0 +1,87 @@
+
+Btree nodes:
+===================
+
+struct btree - an in memory btree node:
+
+Locking:
+=======
+
+Btree nodes in memory are cached, which means that at any given time if a struct
+btree is unlocked it may be evicted from the cache or reused for another btree
+node. In general, whenever we lock a btree node we must check after taking the
+lock that it is still the btree node we wanted.
+
+A major simplification is that struct btree itself is never freed, except during
+shutdown (though the much larger buffers it points to that contain the on disk
+btree node may be freed). This means we can drop and retake locks without
+needing reference counts.
+
+We aggressively drop and retake locks on btree nodes - locks on btree nodes
+should never be held during some other long running operation (e.g. IO). This is
+very important for the overall latency of the system.
+
+A single lock is embedded in struct btree, which is a SIX lock -
+shared/intent/exclusive. It’s like a read/write lock, but with an extra
+intermediate state - intent. Intent locks conflict with other intent locks, but
+do not conflict with read locks, and an intent lock must be held before taking a
+write lock. Otherwise, they work like normal read/write locks.
+
+The SIX locks also have a sequence number, which is incremented when taking or
+releasing a write lock. This is used for dropping and retaking locks: whenever
+we have a btree node locked, we record the current sequence number (in struct
+btree_iter). Then we can unlock and then retake that lock iff the sequence
+number hasn’t changed - if relock succeeds that means our iterator for that
+btree node is still valid.
+
+The write lock protects all mutation of struct btree or its contents, with one
+major exception - writing a btree node to disk can be done with only a read
+lock. For btree node writes, the dirty bit on the btree node provides exclusion
+between multiple threads calling `__bch2_btree_node_write()` simultaneously.
+
+Historically, this was required because the update path (in btree_update_leaf.c)
+would get a journal reservation with intent locks held on the relevant btree
+nodes, but getting a journal reservation may require invoking journal reclaim
+and flushing btree nodes, perhaps those same nodes the update had locked. The
+current code no longer blocks on getting a journal reservation with btree nodes
+locked, though. It’s still desirable to be able to write a btree node with only
+a read lock as we can get a read lock with much lower latency than an intent
+lock.
+
+Writing a btree node requires updating b->written (the count of how many sectors
+in that btree node have been written to disk; btree nodes are log structured);
+this means b->written can only be accessed/trusted with a write lock held. This
+also means that in the update path we can’t check if a btree node has enough
+space for an update or will need to be split until after taking a write lock.
+
+Lock ordering:
+===========
+
+A thread may have many btree nodes locked simultaneously; this means we have to
+define and adhere to a lock ordering or we’ll have deadlocks. But, code using
+and traversing btree iterators can’t be expected to know about this lock
+ordering: instead, whenever we lock a btree node we check what other nodes we
+have locked (via struct btree_trans and btree_iter) and if we’re violating lock
+ordering (and trylock fails) locking the btree node fails - and the btree
+transaction must be restarted from the top, where bch2_trans_begin() will
+re-traverse all iterators in the correct order.
+
+Lifecycle:
+=======
+
+Btree nodes are cached and kept in a hash table; bch2_btree_node_get() will look
+up a btree node in the cache and return it locked. Btree roots are the
+exception; they are pinned in memory and may be accessed directly via
+`c->btree_roots[BTREE_ID].b`.
+
+After locking a btree root (in btree_iter_traverse_one() ->
+btree_iter_lock_root()) we must check if that node is still the root node of
+that btree - similarly to how in bch2_btree_node_get() after looking up a node
+in the hash table and locking it we must check that it is still the node we
+wanted.
+
+Btree nodes in the btree cache hash table are indexed by the physical on disk
+pointer, not their logical key. This is because btree node split/compact
+operations are COW; we allocate new btree node(s), update the parent to point to
+the new node(s), and then free the old node - thus we have to add the new btree
+nodes to the btree node cache before freeing the old one.
-- 
cgit v1.2.3