From a6a19dc8080b26a79fb89c1292f541e493fae961 Mon Sep 17 00:00:00 2001 From: Kent Overstreet Date: Tue, 21 Mar 2017 15:48:25 -0800 Subject: first post! --- Architecture.mdwn | 938 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 938 insertions(+) create mode 100644 Architecture.mdwn (limited to 'Architecture.mdwn') diff --git a/Architecture.mdwn b/Architecture.mdwn new file mode 100644 index 0000000..f3f6be7 --- /dev/null +++ b/Architecture.mdwn @@ -0,0 +1,938 @@ +[[!meta stylesheet=BcacheGuide rel="stylesheet" title="doc"]] + +# The Programmer's Guide to bcache: + +This document is intended to cover the design and core concepts of bcache, and +serve as a guide to programmers trying to become familiar with the bcache +codebase. It'll try to cover how various features are built on top of the core +concepts and how everything fits together. + +[[!toc levels=3 startlevel=2]] + +## Introduction + +The core abstraction in bcache is a key value store ("the btree"). To the btree +code the keys it indexes are just large, fixed sized bitstrings/integers, with +opaque variable sized values. + +Aside from some low level machinery, essentially all metadata is stored as +key/value pairs: + + - Extents are one type of key where the value is a list of pointers to the + actual data. + + - Inodes are stored indexed by inode number, where the value is the inode + struct. + + - Dirents and xattrs are stored with one key per dirent/xattr. + +The bulk of the complexity in bcache is in the implementation of the btree; +posix filesystem semantics are implemented directly on top of it, with the btree +doing all the heavy lifting with respect to e.g. locking and on disk +consistency. + +In general, bcache eschews heavyweight transactions in favor of ordering +filesystem updates. In this, bcachefs's implementation is in spirit not unlike +softupdates - however, since bcachefs is working with logical index updates and +not physical disk writes, it works out to be drastically simpler. The btree +guarantees that ordering is preserved of all index updates, without ever needing +flushes - see the section on [[sequential consistency|BcacheGuide/#index10h3]]. + +We use this approach is used for creating/deleting files and links, and for +logical atomicity of appends/truncates. For those unfamiliar with softupdates, +this means that when creating files, we create the inode and then create the +dirent, and when creating a hardlink we increment i_nlinks and then create the +new dirent - and the order is reversed on deletion. On unclean shutdown we might +leak inode references - but this is easily handled with a garbage collection +pass. + +We do have some support for transactional updates, however - primarily for +rename(). + +### Keys (bkey and bpos) + +We have separate structs for a search key (`struct bpos`), and the outer +container that holds a key and a value (`struct bkey`). + +Here are their definitions: + + struct bpos { + u64 inode; + u64 offset; + u32 snapshot; + }; + + struct bkey { + u8 u64s; + u8 format; + u8 type; + + u8 pad; + u32 version; + u32 size; + struct bpos p; + }; + +The high bits of the search key are the inode field and the low bits are the +snapshot field (note that on little endian the order in the struct will actually +be reversed). + +Not all code uses all fields of the search key (and snapshot is currently +unused), and some code uses the fields in different ways; it's up to the user of +the btree code to use the fields however they see fit. However, the inode field +generally corresponds to an inode number (of a file in a filesystem), and for +extents the offset field corresponds to the offset within that file. + +Some notes on the bkey fields: + + - The u64s field is the size of the combined key and value, in units of u64s. + It's not generally used directly: use `bkey_val_u64s()` or `bkey_val_bytes()` + for a higher level interface. + + - The format field is internal to the btree implementation. It's for packed + bkeys - bkeys are not usually stored in the btree in the in memory. Only very + low level btree code need be concerned with this, however. + + - The type field denotes the type of the value. For example, the directory + entry code defines two different value types, `bch_dirent` (type + `BCH_DIRENT`, 128) and `bch_dirent_whiteout` (`BCH_DIRENT_WHITEOUT`, 129). + + - The size field is only used by extents (0 for all other keys) + +### Values: + +Values are stored inline with bkeys. Each value type (for all the types that +have nonempty values) will have a struct definition, and then macros generate a +variety of accessor functions and wrapper types. + +Also, while values are _stored_ inline with the keys, due to packing they aren't +typically passed around that way; when accessing a key stored in the btree the +btree code will unpack the key and return a wrapper type that contains pointers +to the unpacked key and the value. + +The scheme for the wrapper types is: + + bkey_i key with inline value + bkey_s key with split value (wrapper with pointers to key and value) + bkey_s_c constent key with split value + +The code has the wrapper types above for generic keys with values, and value +types have the corresponding wrapper types generated with pointers to the value +of the correct type. + +For example, here's the struct definition for extended attributes: + + struct bch_xattr { + struct bch_val v; + u8 x_type; + u8 x_name_len; + u16 x_val_len; + u8 x_name[]; + }; + +The wrapper types will be `bkey_i_xattr`, `bkey_s_xattr`, and `bkey_s_c_xattr`. +When the xattr code is doing a lookup in the btree, the btree iterator will +return (via e.g. `btree_iter_peek()`) a key of type `bkey_s_c`, and then the +xattr code will use `bkey_s_c_to_xattr(k)` (after switching on the type field) +to get to the xattr itself. + +Code should always `switch()` off of or check the bkey type field before +converting to their particular type; the accessors all assert that the type +field is the expected one, so you'll get a `BUG_ON()` if you don't. + +### Btree interface: + +Keys are indexed in a small number of btrees: one btree for extents, another for +inodes, another for dirents, etc. + +We can add new btree types for new features without on disk format changes - +many planned features will be added this way (e.g. for reed solomon/raid6, we'll +be adding another btree to index stripes by physical device and lba. More on +this later). + +Some important properties/invariants: + + - There are no duplicate keys; insertions implicitly overwrite existing keys. + + Related to this, deletion is not exposed as a primitive: instead, deletion is + done by inserting a key of type `KEY_TYPE_DELETED`. (The btree code + internally handles overwrites by setting the old key to `KEY_TYPE_DELETED`, + which will then be deleted when the relevant btree node is compacted. Old + deleted keys are never visible outside the btree code). + + - Ordering of insertions/updates is _always_ preserved, across unclean + shutdowns and without any need for flushes. + + This is important for the filesystem code. Creating a file or a new hardlink + requires two operations: + + - create new inode/increment inode refcount + - create new dirent + + By doing these operations in the correct order we can guarantee that after an + unclean shutdown we'll never have dirents pointing to nonexistent inodes - we + might leak inode references, but it's straightforward to garbage collect those + at runtime. + +Lookups/insertions are done via the btree iterator, defined in btree.h: + + struct btree_iter; + + void bch_btree_iter_init(struct btree_iter *, + struct cache_set *, + enum btree_id, + struct bpos); + int bch_btree_iter_unlock(struct btree_iter *); + + struct bkey_s_c bch_btree_iter_peek(struct btree_iter *); + void bch_btree_iter_advance_pos(struct btree_iter *); + + - A "`cache_set`" corresponds to a single filesystem/volume - the name coming + from bcache's starting point as a block cache, and turned into a `cache_set` + instead of just a cache when it gained the ability to manage multiple + devices. + + - The `btree_id` is one of `BTREE_ID_EXTENTS`, `BTREE_ID_INODES`, etc. + + - The bpos argument is the position to start iterating from. + +`bch_btree_iter_unlock()` unlocks any btree nodes the iterator has locked; if +there was an error reading in a btree node it'll be returned here. + +`bch_btree_iter_peek()` returns the next key after the iterator's current +position. + +`bch_btree_iter_advance_pos()` advance's the iterator's position to immediately +after the last key returned. + +Lookup isn't provided as a primitive because most usage is geared more towards +iterating from a given position, but we now have enough to implement it: + + int lookup(struct bpos search) + { + struct btree_iter iter; + struct bkey_s_c k; + int ret = 0; + + bch_btree_iter_init(&iter, c, BTREE_ID_EXAMPLE, search); + + k = bch_btree_iter_peek(&iter); + if (!k.k || bkey_cmp(k.k->p, search) + ret = -EEXIST; + + bch_btree_iter_unlock(&iter); + return ret; + } + +If we want to iterate over every key in the btree (or just from a given point), +there's the convenience macro `for_each_btree_key()`: + + struct btree_iter iter; + struct bkey_s_c k; + + for_each_btree_key(&iter, c, BTREE_ID_EXAMPLE, POS_MIN, k) + printk("got %llu:%llu\n", k.k->p.inode, k.k->p.offset); + + bch_btree_iter_unlock(&iter); + +#### Updates: + +Insertion is most often done with `bch_btree_insert_at()`, which takes an +iterator and inserts at the iterator's current position. This is often used in +conjunction with `bch_btree_iter_peek_with_holes()`, which returns a key +representing every valid position (synthesizing one of `KEY_TYPE_DELETED` if +nothing was found). + +This is highly useful for the inode and dirent code. For example, to create a +new inode, the inode code can search for an empty position and then use +`bch_btree_insert_at()` when it finds one, and the btree node's lock will guard +against races with other inode creations. + +Note that it might not be possible to do the insert without dropping locks, e.g. +if a split was required; the `BTREE_INSERT_ATOMIC` flag indicates that the +insert shouldn't be done in this case and `bch_btree_insert_at()` will return +`-EINTR` instead, and the caller will loop and retry the operation. + +The extents code also uses a similar mechanism to implement a cmpxchg like +operation which is used by things like copygc that move data around in the +background - the index update will only succeed if the original key was present, +which guards against races with foreground writes. + +#### Linked iterators and multiple atomic updates: + +These mechanisms are quite new, and the interfaces will likely change and become +more refined in the future: + +Two or more iterators (up to a small fixed number) can be linked with +`bch_btree_iter_link()` - this causes them to share locks, so that if they are +used to take locks on the same node that would ordinarily conflict (e.g. holding +a read lock on a leaf node with one iterator while updating the same node via +another iterator) won't deadlock. + +This makes it possible to get a consistent view of multiple positions in the +btree (or different btrees) at the same time. + +Furthermore, updates via one iterator will not invalidate linked iterators - but +be careful, a /pointer/ returned via e.g. `bch_btree_iter_peek()` will still be +invalidated by an update (via `bch_btree_insert_at()`). But the other iterator +will be kept internally consistent, and it won't have to drop locks or redo any +traversals to continue iterating. + +Then, updates to multiple (linked) iterators may be performed with +`bch_btree_insert_at_multi()`: it takes a list of iterators, and keys to insert +at each iterator's position. Either all of the updates will be performed without +dropping any locks, or else none of them (the `BTREE_INSERT_ATOMIC` flag is +implicit here, as it makes no sense to use `bch_btree_insert_at_multi()` when +it's not required). Additionally, the updates will all use the same journal +reservation, so the update will be atomic on disk as well. + +There are some locking considerations that are not yet hidden/abstracted away +yet: + + - It is not permissible to hold a read lock on any btree node while taking an + intent lock on another. This is because of a deadlock it would cause (an even + more obscure version of the one documented in section on + [[locking|BcacheGuide/#index8h3]] which I will hopefully document in the + future). + + This is easy to work around - just traverse the iterator taking intent locks + first (with peek(), or `bch_btree_iter_traverse()` directly if more + convenient), and if looping unlock the iterator holding read locks before + re-traversing the intent lock iterator. Since the iterator code will first + attempt to relock nodes (checking the remembered sequence number against the + lock's current sequence number), there's no real cost to unlocking the read + iterator. + + - If linked iterators are only being used to take read locks there are no + ordering requirements, since read locks do not conflict with read locks - + however, when taking intent locks the iterator pointing to the lower position + must be traversed first. + + I haven't yet come up with a satisfactory way of having the iterator code + handle this - currently it's up to the code using linked iterators. See + `bch_dirent_rename()` for an example. + +## Extents, pointers, and data: + +Bcache is extent based, not block based; its extents are much like extents in +other filesystems that has them - a variable sized chunk of data. From the point +of view of the index, they aren't positions, they're ranges or half open +intervals (note that a 0 size extent doesn't overlap with anything). + +Bcache's extents are indexed by inode:offset, and their size is stored in the +size field in struct bkey. The offset and size are both in 512 byte sectors (as +are the pointer offsets). The offset field denotes the _end_ position of the +extent within the file - a key with offset 8 and size 8 points to the data for +sectors 0 through 7. + +(This oddity was a result of bcache's btree being designed for extents first, +and non extent keys coming later - it makes searching for extents within a +certain range cleaner when iterating in ascending order). + +Inside the value is a list of one or more pointers - if there's more than one +pointer, they point to replicated data (or possibly one of the copies is on a +faster device and is considered cached). + +Here's a simplified version (in the latest version the extent format has gotten +considerably more complicated in order to incorporate checksums and compression +information, but the concept is still the same): + + struct bch_extent_ptr { + __u64 offset:48, + dev:8, + gen:8; + }; + + struct bch_extent { + struct bch_val v; + + struct bch_extent_ptr ptr[0] + }; + + - The device field is the index of one of the devices in the cache set (i.e. + volume/filesystem). + + - The offset field is the offset of the start of the pointed-to data, counting + from the start of the device in units of 512 byte sectors. + + - "gen" is a generation number that must match the generation number of the + bucket the pointer points into for the pointer to be consider valid (not + stale). + +### Buckets and generations: + +This mechanism comes from bcache's origins as just a block cache. + +The devices bcache manages are divided up into fixed sized buckets (typically +anywhere from 128k to 2M). The core of the allocator works in terms of buckets +(there's a sector allocator on top, similar to how the slab allocator is built +on top of the page allocator in Linux). + +When a bucket is allocated, it is written to once sequentially: then, it is +never written to again until the entire bucket is reused. When a bucket is +reused, its generation number is incremented (and the new generation number +persisted before discarding it or writing to it again). If the bucket still +contained any cached data, incrementing the generation number is the mechanism +that invalidates any still live pointers pointing into that bucket (in +programming language terminology, bcache's pointers are weak pointers). + +This has a number of implications: + + - Invalidating clean cached data is very cheap - there's no cost to keeping a + device full of clean cached data. + + - We don't persist fine grained allocation information: we only persist the + current generation number of each bucket, and at runtime we maintain in + memory counts of the number of live dirty and cached sectors in each bucket - + these counts are updated on the fly as the index is updated and old extents + are overwritten and new ones added. + + This is a performance tradeoff - it's a fairly significant performance win at + runtime but it costs us at startup time. Eventually, we'll probably implement + a reserve that we can allocate out of at startup so we can do the initial + mark and sweep in the background. + + This does mean that at startup we have to walk the extents btree once to + repopulate these counts before we can do any allocation. + + - If there's a lot of internal fragmentation, we do require copying garbage + collection to compact data - rewriting it into new buckets. + + - Since the generation number is only 8 bits, we do have to guard against + wraparound - this isn't really a performance issue since wraparound requires + rewriting the same bucket many times (and incoming writes will be distributed + across many buckets). We do occasionally have to walk the extents to update + the "oldest known generation number" for every bucket, but the mark and sweep + GC code runs concurrently with everything else, except for the allocator if + the freelist becomes completely drained before it finishes (the part of GC + that updates per bucket sector counts mostly isn't required anymore, it may + be removed at some point). + + +### IO path: + +XXX: describe what consumes bch_read() and bch_write() + +The main entry points to the IO code are + + - `bch_read()` + + - `bch_read_extent()` + + - `bch_write()` + +The core read path starts in `bch_read()`, in io.c. The algorithm goes like: + + - Iterate over the extents btree (with `for_each_btree_key_with_holes()`), + starting from the first sector in the request. + + - Check the type of the returned key: + + - Hole? (`KEY_TYPE_DELETED`) - just return zeroes + + - Extent? Check for a pointer we can read from. + + - If they're all stale, it was a cached extent (i.e. we were caching + another block device), handle it like a hole. + + - If the relevant device is missing/not online, return an error. + + - Ok, we have a pointer we can read from. If the extent is smaller than + the request, split the request (with `bio_split()`), and issue the + request to the appropriate device:sector. + + - Iterate until entire request has been handled. + +The write path is harder to follow because it's highly asynchronous, but the +basic algorithm is fairly simple there too: + + - Set up a key that'll represent the data we're going to write (not the + pointers yet). + + - Allocate some space to write to (with `bch_alloc_sectors()`); add the + pointer(s) to the space we allocated to the key we set up previously. + + - Were we able to allocate as much space as we wanted? If not, split both the + key and the request. + + - Issue the write to the space we just allocated + + - Loop until we've allocated all the space we want, building up a list of keys + to insert. + + - Finally, after the data write(s) complete, insert the keys we created into + the extents btree. + +#### Checksumming/compression: + +As mentioned previously, bch_extent got more complicated with data checksumming +and compression. + +For data checksumming, what we want to do is store the checksum with the key and +pointers, no the data - if you store the checksum with the data verifying the +checksum only tells you that you got the checksum that goes with that data, it +doesn't tell you if you got the data you actually wanted - perhaps the write +never happened and you're reading old data. There's a number of subtle ways data +can be corrupted that naively putting the checksum with the data won't protect +against, and there are workarounds for most of them but it's fundamentally not +the ideal approach. + +So we want to store the checksum with the extent, but now we've got a different +problem - partially overwritten extents. When an extent is partially +overwritten, we don't have a checksum for the portion that's now live and we +can't compute it without reading that data in, which would probably not be +ideal. + +So we need to remember what the original extent was, so that later when that +data is read we can read in the entire original extent, checksum that, and then +return only the part that was live. + +Compression leads to a similar issue, where if the extent is compressed we won't +be able to part of it and decompress it, we have to be able to read the entire +extent. + +One simplifying factor is that since buckets are reused all at once, as long as +any part of the original extent is live the rest of it will still be there - we +don't have to complicate our accounting and have two notions of live data to +make sure the dead parts of the extent aren't overwritten. + +So, for compressed or checksummed extents here's the additional information +we'll need to store in struct bch_extent: + + struct bch_extent_crc32 { + __u32 type:1, + offset:7, + compressed_size:8, + uncompressed_size:8, + csum_type:4, + compression_type:4; + __u32 csum; + }; + +Now, when trimming an extent, instead of modifying the pointer offset to point +to the start of the new live region (as we still do for non +checksummed/compressed extents) we add to the offset field in the extent_crc +struct - representing the offset into the extent for the currently live portion. +The compressed_size and uncompressed_size fields store the original sizes of the +extent, while the size field in the key continues to represent the size of the +live portion of the extent. + +There's another complicating factor is data migration, from copying garbage +collection and tiering - in general, when they need to move some data we can't +expect them to rewrite every replica of an extent at once (since copygc is +needed for making forward progress, that would lead to nasty deadlocks). + +But if we're moving one of the replicas of an extent (or making another copy, as +in the case of promoting to a faster tier) we don't want to have to copy the +dead portions too - we'd never be able to get rid of the dead portions! Thus, +we need to handle extents where different pointers are in different formats. + +Having one bch_extent_crc32 field per pointer (or bch_extent_crc64, for the case +where the user is using 64 bit checksums) would make our metadata excessively +large, unfortunately - if they're doing two or three way replication that would +roughly double the size of the extents btree, or triple it for 64 bit checksums. + +So the new format struct bch_extent is a list of mixed pointers and extent_crc +fields: each extent_crc field corresponds to the pointers that come after it, +until the next extent_crc field. See include/uapi/linux/bcache.h for the gory +details, and extents.h for the various functions and macros for iterating over +and working with extent pointers and crc fields. + + +#### RAID5/6/Erasure coding: + +This part hasn't been implemented yet - but here's the rough design for what +will hopefully be implemented in the near future: + +Background: In conventional RAID, people have long noted the "RAID hole" and +wanted to solve it by moving RAID into the filesystem - this what ZFS does. + +The hole is that it's impossible for block layer raid to atomically write a +stripe: while writing a stripe, there's inevitably going to be a window where +the P and Q blocks are inconsistent and recovery is impossible. The best that +they can do is order the writes so that the data blocks are written first, then +the P/Q blocks after, and then on unclean shutdown rebuild all the P/Q blocks +(with perhaps a bitmap to avoid having to resync the whole device, as md does). + +The approach that ZFS takes is to fragment incoming writes into (variable sized) +stripes, so that it can write out the entire stripe all at once and never have a +pointer in its index pointing to an incomplete/inconsistent stripe. + +This works, but fragmenting IO is painful for performance. Not only writes are +fragmented, but reads of the same data are fragmented too; and since the latency +of the read has to be the latency of the slowest fragment, this drives your +median latency to your tail latency. Worse, on disks you're spending a lot more +seeks than you were before; even on flash it's not ideal because since the +fragmenting is happening above all the scheduling, both on the host and the +device you're still subject to the tail latency effect. + +What we want is to be able to erasure encode unrelated data together - while +avoiding update in place. This is problematic to do for foreground writes, but +if we restrict ourselves to erasure encoding data in the background - perhaps +even as it's being copied from the fast tier (flash) to the slow tier (disk) - +the problem is much easier. + +The idea is to still replicate foreground writes, but keep track of the buckets +that contain replicated data. Then, a background job can take e.g. 5 buckets +that contain unrelated data, allocate two new buckets, and then compute the p/q +for the original five buckets and write them to the two new buckets. Then, for +every extent that points into those first five buckets (either with an index +scan, or by remembering keys from recent foreground writes), it'll update each +extent dropping their extra replicas and adding pointers to the p/q buckets - +with a bit set in each pointer telling the read path that to use those it has to +do a reconstruct read (which will entail looking up the stripe mapping in +another btree). + +This does mean that we can't reuse any of the buckets in the stripe until each +of them are empty - but this shouldn't be much trouble, we just have to teach +copygc about it when it's considering which buckets to evacuate. + +### Copying garbage collection: + +### Tiering: + +### Allocation: + +### Multiple devices: + +## Btree internals: + +At a high level, bcache's btree is a copy on write b+ tree. The main difference +between bcache's b+ tree and others is the nodes are very large (256k is +typical) and log structured. Like other COW b+ trees, updating a node may +require recursively rewriting every node up to the root; however, most updates +(to both leaf nodes and interior nodes) can be done with only an append, until +we've written to the full amount of space we originally reserved for the node. + +A single btree node log entry is represented as a header and a list of bkeys, +where the bkeys are all contiguous in memory and in sorted order: + + struct bset { + /* some fields ommitted */ + ... + u16 u64s; + struct bkey_packed start[0]; + }; + +Since bkeys are variable length, it's not possible to access keys randomly +without other data structures - only iterate sequentially via `bkey_next()`. + +A btree node thus contains multiple independent bsets that on lookup all must be +searched and iterated over. At any given time there will be zero or more bsets +that have been written out, and a single dirty bset that new keys are being +inserted into. + +As the btree is modified we must maintain the invariant in memory that there are +no duplicates (keys that compare as equal), excluding keys marked as deleted. +When an insertion overwrites an existing key, we will mark the existing key as +deleted (by setting `k->type = KEY_TYPE_DELETED`) - but until the entire node is +rewritten the old key will still exist on disk. To handle this, when a btree +node is read, the first thing we do is a mergesort of all the bsets it contains, +and as part of the mergesort duplicate keys are found and the older bkeys are +dropped - carefully matching the same changes we made in memory when doing the +insertions. + +This hopefully explains how the lack of deletion as a primitive is a result of +the way the btree is implemented - it's not possible to delete a key except by +inserting a whiteout, which will be dropped when the btree node eventually fills +up and is rewritten. + +Once a bset has been written out it may also be sorted, in memory, with other +bsets that have also been written out - we do so periodically so that a given +btree node will have only a few (at most three) bsets in memory: the one being +inserted into will be at most 8 or 16k, and the rest roughly forming a geometric +progression size, so that sorting the entire node is relatively infrequent. + +This resorting/compacting in memory is one of the main ways bcache is able to +efficiently use such large btree nodes. The other main trick is to take +advantage of the fact that bsets that have been written out are, aside from +resorts, constant; we precompute lookup tables that would be too inefficient to +use if they had to be modified for insertions. The bcache code refers to these +lookup tables as the auxiliary search trees. + +### Locking + +Bcache doesn't use read/write locks for btree nodes - the locks it uses have +three states: shared, intent and exclusive (SIX locks). Shared and exclusive +correspond to read and write, while intent is sort of in the middle - intent +locks conflict with other intent locks (like write locks), but they don't +conflict with read locks. + +The problem intent locks solve is that with a regular read/write lock, a read +lock can't be upgraded to a write lock - that would lead to deadlock when +multiple threads with read locks tried to upgrade. With a complicated enough +data structure, updates will need to hold write locks for exclusion with other +updates for much longer than the part where they do the actual modification that +needs exclusion from readers. + +For example, consider the case of a btree split. The update starts at a leaf +node, and discovers it has to do a split. But before starting the split it has +to acquire a write lock on the parent node, primarily to avoid a deadlock with +other splits: it needs at least a read lock on the parent (roughly in order to +lock the path to the child node), but it couldn't then upgrade that read lock to +a write lock in order to update the parent with the pointers to the new children +because that would deadlock with threads splitting sibling leaf nodes. + +Intent locks solve this problem. When doing a split it suffices to acquire an +intent lock on the parent - write locks are only ever held modifying the in +memory btree contents (which is a much shorter duration than the entire split, +which requires waiting for the new nodes to be written to disk). + +Intent locks with only three states do introduce another deadlock, though: + + Thread A Thread B + read | Parent | intent + intent | Child | intent + +Thread B is splitting the child node: it's allocated new nodes and written them +out, and now needs to take a write lock on the parent in order to add the +pointers to the new nodes (after which it will free the old child). + +Thread A just wants to insert into the child node - it has a read lock on the +parent node and it's looked up the child node, and now it's waiting on thread +B to get an intent lock on the child. + +But thread A has blocked thread B from taking its write lock in order to update +the parent node, and thread B can't drop its intent lock on the child until +after the new nodes are visible and it has freed the child node. + +The way this deadlock is handled is by enforcing (in `bch_btree_node_get()`) that +we drop read locks on parent nodes _before_ taking intent locks on child nodes - +this might cause us to race have the btree node freed out from under us before +we lock it, so we check for that after grabbing the intent lock and redo the +traversal if necessary. + +One other thing worth mentioning is bcache's btree node locks have embedded +sequence numbers, which are incremented when taking and releasing write locks +(much like seqlocks). This allows us to aggressively drop locks (because we'll +usually be able to retake the lock), and we also use it for a `try_upgrade()` - +if we discover we need an intent lock (e.g. for a split, or because the caller +is inserting into a leaf node they didn't get an intent lock for) we'll usually +be able to get it without having to unwind and redo the traversal. + +### Journaling + +Bcache's journal is foremost an optimization for the btree. COW btrees do not +require journals for on disk consistency - but having a journal allows us to +coalesce random updates across multiple btree nodes and flush the nodes +asynchronously. + +The journal is a purely logical log, a list of insertions - bkeys - to reinsert +on recovery in the same order they're present in the journal. Provided every +index update is journaled (a critical invariant here), redoing those insertions +is an idempotant operation. See `bch_journal_replay_key()` in journal.c - the +journal uses the same insert path as everything else and doesn't know anything +about the structure of the btree. + +It's critical that keys appear in the journal in the same order as the +insertions happened, so both are done under the btree node's write lock: see +`bch_btree_insert_and_journal()` in btree.c, which calls `bch_bset_insert()` at +the top (inserting into the btree node in memory) and `bch_journal_add_keys()` +to journal the same key at the bottom. + +At this point in the btree insertion path, we've already marked any key(s) that +we overlapped with (possibly more than one for extents) as deleted - i.e. the +btree node is inconsistent so the insertion must happen before dropping our +write lock. + +So we also have some machinery for journal reservations: we reserve some amount +of space in the journal (in units of u64s, as always) midway through the +insertion path (in `bch_btree_insert_keys()`). The journal reservation machinery +(as well as adding the keys to the journal) is entirely lockless in the +fastpath. + +### Sequential consistency + +As mentioned in the first section on indexing, bcache's b+ tree provides the +guarantee that ordering of updates is always preserved - whether to different +nodes or different btrees (i.e. dirents vs. inodes). + +The journal helps here as well. Ordering is always preserved in the journal +itself: if a key at time t made it to the journal and was present after unclean +shutdown/recovery, all the keys journalled prior to time t will either be in the +journal, or in btree nodes that were flushed to disk. + +The btree by itself does not provide ordering though - if updates happened to +two different leaf nodes, the leaf nodes could have been flushed in any order - +and importantly, either of them could have been written before the last journal +entry that contained keys for that btree node write. + +That is, while typically the journal write will happen before the btree node is +flushed - we don't prevent the btree node from being flushed right away, and we +certainly don't want to: since flushing btree nodes is required both to reclaim +memory and to reclaim space in the journal, just the mere though of the +potential deadlocks is rather horrifying. + +Instead, we have a rather neat trick: in struct bset (the common header for a +single btree node write/log entry) we track the most recent sequence number of +all the journal entries the keys in this bset went into. + +Then, on recovery when we're first walking the btree if we find a bset with a +higher journal sequence number than the most recent journal entry we actually +found - we merely ignore it. + +The bset we ignore will likely also contain keys in older journal entries - +however, those journal entries will all be in the set we are replaying because +they were considered dirty until after the bset was written, and they were +marked as dirty on disk until a journal entry was written after the bset's write +completed - which didn't happen. Thus, ignoring those bsets cannot cause us to +lose anything that won't be replayed. + +There is a decent amount of extra machinery required to make this scheme work - +when we find bsets newer than the newest journal entry we have to blacklist the +journal sequence number they referred to - and we have to mark it as +blacklisted, so that on the next recovery we don't think there's a journal entry +missing - but that is the central idea. + +## Auxiliary search trees + +The code for doing lookups, insertions, etc. within a btree node is relatively +separated from the btree code itself, in bset.c - there's a struct +btree_node_iter separate from struct btree_iter (the btree iterator contains one +btree node iterator per level of the btree). + +The bulk of the machinery is the auxiliary search trees - the data structures +for efficiently searching within a bset. + +There are two different data structures and lookup paths. For the bset that's +currently being inserted into, we maintain a simple table in an array, with one +entry per cacheline of data in the original bset, that tracks the offset of the +first key in that cacheline. This is enough to do a binary search (and then a +linear search when we're down to a single cacheline), and it's much cheaper to +keep up to date. + +For the const bsets, we construct a binary search tree in an array (same layout +as is used for heaps) where each node corresponds to one cacheline of data in +the original bset, and the first key within that cacheline (note that the +auxiliary search tree is not full, i.e. not of size (2^n) - 1). Walking down the +auxilialy search tree thus corresponds roughly to doing a binary search on the +original bset - but it has the advantage of much friendlier memory access +patterns, since at every iteration the children of the current node are adjacent +in memory (and all the grandchildren, and all the great grandchildren) - meaning +unlike with a binary search it's possible to prefetch. + +Then there are a couple tricks we use to make these nodes as small as possible: + + - Because each node in the auxiliary search tree corresponds to precisely one + cacheline, we don't have to store a full pointer to the original key - if we + can compute given a node's position in the array/tree its index in an inorder + traversal, we only have to store the key's offset within that cacheline. + + This is done by `to_inorder()` in bset.c, and it's mostly just shifts and bit + operations. + + - Observe that as we're doing the lookup and walking down the tree, we have + constrained the keys we're going to compare against to lie within a certain + range [l, r). + + Then l and r will be equal in some number of their high bits (possibly 0); + the keys we'll be comparing against and our search key will all be equal in + the same bits - meaning we don't have to compare against, or store, any bits + after that position. + + We also don't have to store all the low bits, either - we need to store + enough bits to correctly pivot on the key the current node points to (call it + m); i.e. we need to store enough bits to tell m apart from the key + immeditiately prior to m (call it p). We're not looking for strict equality + comparisons here, we're going to follow this up with a linear search anyways. + + So the node in the auxiliary search tree (roughly) needs to store the bits + from where l and r first differed to where m and p first differed - and + usually that's not going to be very many bits. The full struct bkey has 160 + bit keys, but 16 bit keys in the auxiliary search tree will suffice > 99% of + the time. + + Lastly, since we'd really like these nodes to be fixed size - we just pick a + size and then when we're constructing the auxiliary search tree check if we + weren't able to construct a node, and flag it; the lookup code will fall back + to comparing against the original key. Provided this happens rarely enough, + the performance impact will be negligible. + + The auxiliary search trees were an enormous improvement to bcache's + performance when they were introduced - before they were introduced the + lookup code was a simple binary search (eons ago when keys were still fixed + size). On random lookups with a large btree the auxiliary search trees are + easily over an order of magnitude faster. + +## Bkey packing + +Bkeys have gotten rather large, with 64 bit inodes, 64 bit offsets, a snapshot +field that isn't being used yet, a 32 bit size field that's only used by +extents... That's a lot of wasted bits. + +The packing mechanism was added to address this, and as an added bonus it should +allow us to change or add fields to struct bkey (if we ever want to) without +incompatible on disk format changes (forwards and backwards!). + +The main constraint is lookup performance - any packing scheme that required an +unpack for comparisons would probably be right out. So if we want to be able to +compare packed keys, the requirement is that it be order preserving. That is, we +need to define some function pack(), such that + + bkey_cmp(l, r) == bkey_cmp(pack(l), pack(r)) + +The way it works is for every btree node we define a packed format (and we +recalculate a new packed format when splitting/rewriting nodes); then keys that +are packed in the same format may be compared without unpacking them. + +The packed format itself consists of, for each field in struct bkey + + - the size of that field in the packed key, in bits + - an offset, to be subtracted before packing and when unpacking + +Then, provided keys fit into the packed format (their key fields are not less +than the field offset, and after subtracting the offset their fields fit into +the number of bits in the packed keys) packing is order preserving - i.e. +packing is allowed to fail. + +Then in the lookup path, we pack the search key into the btree node's packed +format - even better, this helps the auxiliary search tree code since it no +longer has to deal with long strings of zeroes between the inode and the offset. + +For extents, keys in the packed format are usually only 8 bytes - vs. 32 bytes +for struct bkey. And with no replication, checksumming or compression, the value +is only 8 bytes - so packing can reduce our metadata size by more than half. + +The other interesting thing about this mechanism is that it makes our key format +self describing - we aren't restricted to just defining a format for packed +keys, we can define one for our in memory representation, struct bkey, as well +(and we do). + +Then as long as we have the format corresponding to a key, we can use it - we +should just be careful not to drop fields it has that we don't understand (by +converting it to our current in memory representation and back). + +All that's left to do is to store formats (besides the btree node-local format) +in the superblock. Right now the btree node local format is defined to be format +0 and the in memory format - struct bkey itself - is format 1, +`BKEY_FORMAT_CURRENT`. Changing struct bkey would be as simple as defining a new +format, and making sure the new format gets added to existing superblocks when +it's used. + +## Error handling + +This section assumes we have no fallback via replication to try (i.e. retry the +write to another device) - or we have, and that's failed: we're at the point +where we're forced to return errors up the stack. + +Error handling of data writes is nothing special - we can return an error for +just the particular IO and be done with it. + +Error handling for metadata IO is a more interesting topic. Fundamentally, if a +metadata write fails (be it journal or btree), we can't fail return errors for +some subset of the outstanding operations - we have to stop everything. + +This isn't peculiar to bcache - other journaling filesystems have the same +behaviour. There's a few reasons; partly it comes down to the fact that by the +time we do the actual IO, the index update (dirent creation, inode update) has +long completed so we can't return an error and unwind, and we can't +realistically backtrack from the contents of a journal write or a btree node +write to the in memory state that's now inconsistent. + +So we have to consider the entire in memory state of the filesystem inconsistent +with what's on disk, and the only thing we can really do is just emergency +remount RO. + +### Journal IO errors + +### Btree IO errors -- cgit v1.2.3