From ea3eb6d4894969cc940c8d9086212aa8bb124f4a Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Sat, 12 Oct 2019 13:46:24 -0400 Subject: btree nodes doc --- BtreeNodes.mdwn | 87 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) create mode 100644 BtreeNodes.mdwn (limited to 'BtreeNodes.mdwn') diff --git a/BtreeNodes.mdwn b/BtreeNodes.mdwn new file mode 100644 index 0000000..8b01643 --- /dev/null +++ b/BtreeNodes.mdwn @@ -0,0 +1,87 @@ + +Btree nodes: +=================== + +struct btree - an in memory btree node: + +Locking: +======= + +Btree nodes in memory are cached, which means that at any given time if a struct +btree is unlocked it may be evicted from the cache or reused for another btree +node. In general, whenever we lock a btree node we must check after taking the +lock that it is still the btree node we wanted. + +A major simplification is that struct btree itself is never freed, except during +shutdown (though the much larger buffers it points to that contain the on disk +btree node may be freed). This means we can drop and retake locks without +needing reference counts. + +We aggressively drop and retake locks on btree nodes - locks on btree nodes +should never be held during some other long running operation (e.g. IO). This is +very important for the overall latency of the system. + +A single lock is embedded in struct btree, which is a SIX lock - +shared/intent/exclusive. It’s like a read/write lock, but with an extra +intermediate state - intent. Intent locks conflict with other intent locks, but +do not conflict with read locks, and an intent lock must be held before taking a +write lock. Otherwise, they work like normal read/write locks. + +The SIX locks also have a sequence number, which is incremented when taking or +releasing a write lock. This is used for dropping and retaking locks: whenever +we have a btree node locked, we record the current sequence number (in struct +btree_iter). Then we can unlock and then retake that lock iff the sequence +number hasn’t changed - if relock succeeds that means our iterator for that +btree node is still valid. + +The write lock protects all mutation of struct btree or its contents, with one +major exception - writing a btree node to disk can be done with only a read +lock. For btree node writes, the dirty bit on the btree node provides exclusion +between multiple threads calling `__bch2_btree_node_write()` simultaneously. + +Historically, this was required because the update path (in btree_update_leaf.c) +would get a journal reservation with intent locks held on the relevant btree +nodes, but getting a journal reservation may require invoking journal reclaim +and flushing btree nodes, perhaps those same nodes the update had locked. The +current code no longer blocks on getting a journal reservation with btree nodes +locked, though. It’s still desirable to be able to write a btree node with only +a read lock as we can get a read lock with much lower latency than an intent +lock. + +Writing a btree node requires updating b->written (the count of how many sectors +in that btree node have been written to disk; btree nodes are log structured); +this means b->written can only be accessed/trusted with a write lock held. This +also means that in the update path we can’t check if a btree node has enough +space for an update or will need to be split until after taking a write lock. + +Lock ordering: +=========== + +A thread may have many btree nodes locked simultaneously; this means we have to +define and adhere to a lock ordering or we’ll have deadlocks. But, code using +and traversing btree iterators can’t be expected to know about this lock +ordering: instead, whenever we lock a btree node we check what other nodes we +have locked (via struct btree_trans and btree_iter) and if we’re violating lock +ordering (and trylock fails) locking the btree node fails - and the btree +transaction must be restarted from the top, where bch2_trans_begin() will +re-traverse all iterators in the correct order. + +Lifecycle: +======= + +Btree nodes are cached and kept in a hash table; bch2_btree_node_get() will look +up a btree node in the cache and return it locked. Btree roots are the +exception; they are pinned in memory and may be accessed directly via +`c->btree_roots[BTREE_ID].b`. + +After locking a btree root (in btree_iter_traverse_one() -> +btree_iter_lock_root()) we must check if that node is still the root node of +that btree - similarly to how in bch2_btree_node_get() after looking up a node +in the hash table and locking it we must check that it is still the node we +wanted. + +Btree nodes in the btree cache hash table are indexed by the physical on disk +pointer, not their logical key. This is because btree node split/compact +operations are COW; we allocate new btree node(s), update the parent to point to +the new node(s), and then free the old node - thus we have to add the new btree +nodes to the btree node cache before freeing the old one. -- cgit v1.2.3