From 277e72be0b604f9ef8042402b6f0e1d6b9ea5a69 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <kent.overstreet@gmail.com>
Date: Tue, 21 Mar 2017 20:24:37 -0800
Subject: import old content

---
 Encryption.mdwn   | 273 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 Transactions.mdwn | 212 ++++++++++++++++++++++++++++++++++++++++++
 index.mdwn        |  20 ++++
 sidebar.mdwn      |   2 +
 4 files changed, 507 insertions(+)
 create mode 100644 Encryption.mdwn
 create mode 100644 Transactions.mdwn

diff --git a/Encryption.mdwn b/Encryption.mdwn
new file mode 100644
index 0000000..4f8a76c
--- /dev/null
+++ b/Encryption.mdwn
@@ -0,0 +1,273 @@
+# bcache/bcachefs encryption design:
+
+This document is intended for review by cryptographers and other experience
+implementers of cryptography code, before the design is frozen. Everything
+described in this document has been implemented and tested.
+
+The code may be found at
+[[https://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-encryption]]
+
+Feedback and comments should be sent to Kent Overstreet,
+kent.overstreet@gmail.com.
+
+Main bcachefs page: [[Bcachefs]]
+
+## Intro:
+
+Bcachefs provides whole-filesystem encryption, using ChaCha20/Poly1305.
+Encryption may be enabled when creating a filesystem, or encryption may be
+enabled on an existing filesystem (TODO: implement interface for enabling
+encryption on an existing filesystem - kernel code exists).
+
+Example:
+
+    $ bcache format --encrypted /dev/sda
+(Enter passphrase when prompted)
+
+    $ bcache unlock /dev/sda
+(Enter passphrase again)
+
+Then mount as normal:
+
+    $ mount /dev/sda /mnt
+
+## Goals:
+
+Bcachefs encryption is meant to be a clean slate design that prioritizes
+security and robustness, and is meant to defend against a wider variety of
+adversarial models than is typical in existing filesystem level or block level
+encryption.
+
+In particular, the goal is to be secure even when the attacker controls the
+storage device itself, and can see reads and writes as they happen and return
+arbitrary data from read requests.
+
+## Filesystem vs. directory encryption
+
+We do not currently offer per directory encryption; instead, we take an "encrypt
+everything" approach.
+
+Rationale:
+
+With per directory encryption, preventing metadata from leaking is highly
+problematic.  For example, file sizes - file sizes are required for fsck, so
+they would have to be stored unencrypted - or failing that some complicated way
+of deferring fsck for that part of the filesystem until the key has been
+provided. There would be additional complications around filenames, xattrs,
+extents (and inline extents), etc. - not necessarily insurmountable, but they
+would definitely lead to a more complicated, more fragile design.
+
+With whole filesystem encryption, it’s much easier to say what is and isn’t
+encrypted, we can encrypt at the granularity of an entire metadata write (a
+journal entry, a btree node) and it's much easier to show that the contents -
+everything after the header for that particular metadata write - will not leak.
+
+### Algorithms
+
+By virtue of working within a copy on write filesystem with provisions for ZFS
+style checksums (that is, checksums with the pointers, not the data), we’re
+able to use a modern AEAD style construction. We use ChaCha20 and Poly1305. We
+use the cyphers directly instead of using the kernel AEAD library. However, we
+do follow pretty closely the approach of [[RFC 7539|https://tools.ietf.org/html/rfc7539]].
+
+Note that ChaCha20 is a stream cypher. This means that it’s critical that we use
+a cryptographic MAC (which would be highly desirable anyways), and also avoiding
+nonce reuse is critical. Getting nonces right is where most of the trickiness is
+involved in bcachefs’s encryption.
+
+The current algorithm choices are not hard coded. Bcachefs already has
+selectable checksum types, and every individual data and metadata write has a
+field that describes the checksum algorithm that was used. On disk, encrypted
+data is represented as a new checksum type - so we now have [none, crc32c,
+crc64, chacha20/poly1305] as possible methods for data to be
+checksummed/encrypted. If in the future we add new encryption algorithms, users
+will be able to switch to the new algorithm on existing encrypted filesystems;
+new data will be written with the new algorithm and old data will be read with
+the old algorithm until it is rewritten.
+
+### Key derivation, master key
+
+Userspace tooling takes the user's passphrase and derives an encryption key with
+scrypt. This key is made available to the kernel (via the Linux kernel's keyring
+service) prior to mounting the filesystem.
+
+On filesystem mount, the userspace provided key is used to decrypt the master
+key, which is stored in the superblock - also with ChaCha20. The master key is
+encrypted with an 8 byte header, so that we can tell if the correct key was
+supplied.
+
+TODO: Add a field to the superblock specifying the key derivation function, so
+that we can transition to newer KDFs later (e.g. Argon2) or specify cost
+parameters.
+
+### Metadata
+
+Except for the superblock, no metadata in bcache/bcachefs is updated in place -
+everything is more or less log structured. Only the superblock is stored
+unencrypted; other metadata is stored with an unencrypted header and encrypted
+contents.
+
+The superblock contains:
+
+ * Label and UUIDs identifying the filesystem
+ * A list of component devices (for multi-device filesystems), and information
+   on their size, geometry, status (active/failed), last used timestamp
+ * Filesystem options
+ * The location of the journal
+
+For the rest of the metadata, the unencrypted portion contains:
+
+ * 128 bit checksum/MAC field
+ * Magic number - identifies a given structure as btree/journal/allocation
+   information, for that filesystem
+ * Version number (of on disk format), flags (including checksum/encryption
+   type).
+ * Sequence numbers: journal entries have an ascending 64 bit sequence number,
+   btree node entries have a random 64 bit sequence number identifying them as
+   belonging to that node. Btree nodes also have a field containing the sequence
+   number of the most recent journal entry they contain updates from; this is
+   stored unencrypted so it can be used as part of the nonce.
+ * Size of the btree node entry/journal entry, in u64s
+
+Btree node layout information is encrypted; an attacker could tell that a given
+location on disk was a btree node, but the part of the header that indicates
+what range of the keyspace, or which btree ID (extents/dirents/xattrs/etc.), or
+which level of the btree is all encrypted.
+
+#### Metadata nonces
+
+ * Journal entries use their sequence number - which is unique for a given
+   filesystem. When metadata is being replicated and we're doing multiple
+   journal writes with the same sequence number - and thus nonce - we really are
+   writing the same data (we only checksum once, not once per write).
+
+ * Btree nodes concatenate a few things for the nonce:
+   - A 64 bit random integer, which is generated per btree node (but btree nodes
+     are log structured, so entries within a given btree node share the same
+     integer).
+   - A journal sequence number. For btree node writes done at around the same
+     point in time, this field can be identical in unrelated btree node writes -
+     but only for btree nodes writes done relatively close in time, so the
+     journal sequence number plus the previous random integer should be more
+     than sufficient entropy.
+   - And lastly the offset within the btree node, so that btree node entries
+     sharing the same random integer are guaranteed a different nonce.
+
+ * Allocation information (struct prio_set):
+   bcache/bcachefs doesn't have allocation information persisted like other
+   filesystems, but this is our closest equivalent - this structure mainly
+   stores generation numbers that correspond to extent pointers.
+
+   Allocation information uses a dedicated randomly generated 96 bit nonce
+   field.
+
+### Data
+
+Data writes have no unencrypted header: checksums/MACs, nonces, etc. are stored
+with the pointers, ZFS style.
+
+Bcache/bcachefs is extent based, not block based: pointers point to variable
+sized chunks of data, and we store one checksum/MAC per extent, not per block: a
+checksum or MAC might cover up to 64k (extents that aren't checksummed or
+compressed may be larger). Nonces are thus also per extent, not per block.
+
+By default, for data extents the Poly1305 MAC is truncated to 80 bits, for space
+efficiency reasons. Optionally the full 128 bit macs may be stored, at the cost
+increasing the size of extents by 8 bytes (with 80 bit macs, an extent with a
+single replica will typically be 32 bytes, or 40 bytes with 128 bit macs).
+
+This should be completely safe for the vast majority of uses cases. Most uses of
+cryptographic MACs are in networked applications, where an attacker may be able
+to send an unlimited number of forged messages: in that environment, a 64 bit
+mac is clearly insufficient - if an attacker is able to send 2^32 forgery
+attempts (not a huge number these days), probability of success is 1 / 2^32 -
+which is not considered a remotely safe margin by cryptographers.
+
+However, with a filesystem, even in the case of a completely compromised device
+(say an attacker has compromised the firmware on the disk, and is able to return
+whatever they want when we read a sector) - if the MAC doesn't match (because
+the attacker is attempting to forge data), we consider the device to be failing
+and very shortly we're going to stop using it - we won't attempt to reread data
+that appears to be corrupt indefinitely. So, attacker gets a very small (on the
+order of 10) attempts to forge a particular extent. In the very worst case, if
+we're trying very hard to migrate data off a device that appears to be bad, the
+attacker might get ~10 attempts multiplied by the number of extents on the
+device - but the number of forgery attempts should be clearly bounded.
+
+If the user is in an environment where transient failures/corruption are
+expected should be tolerated, instead of assuming the device is bad (e.g. the
+disks are accessed over the network, and the network path is known to corrupt
+data) - in that situation 128 bit macs should be used (and in the future we may
+enforce that if the maximum number of read retries is set to more than a small
+number, 128 bit macs must be used if encryption is in use).
+
+#### Extent nonces
+
+We don't wish to simply add a random 96 bit nonce to every extent - that would
+inflate our metadata size by a very significant amount. Instead, keys (of which
+extents are a subset) have a 96 bit version number field; when encryption is
+enabled, we ensure that version numbers are enabled and every new extent gets a
+new, unique version number.
+
+However, extents may be partially overwritten or split, and then to defragment
+we may have to rewrite those partially overwritten extents elsewhere. We cannot
+simply pick a new version number when we rewrite an extent - that would break
+semantics other uses of version numbers expect.
+
+When we rewrite an extent, we only write the currently live portions of the
+extent - we don't rewrite the parts that were overwritten. We can't write it out
+with the same nonce as the original extent.
+
+If we concatenated the version number with the offset within the file, and the
+extent's current size - that would work, except that it would break fcollapse(),
+which moves extents to a new position within a file. We are forced to add some
+additional state to extents.
+
+We could add a small counter that is incremented every time the size of an
+extent is reduced (and the data it points to changes); we can easily bound the
+size of the counter we need by the maximum size of a checksummed extent. But
+this approach fails when extents are split.
+
+What can work is if we add a field for "offset from the start of the original
+extent to the start of the current extent" - updating that field whenever we
+trim the front of an extent.
+
+If we have that, then we could simply skip ahead in the keystream to where the
+currently live data lived in the original extent - there's no problem with nonce
+reuse if you're encrypting exactly the same data. Except - that fails with
+compression, since if we take an extent, drop the first 4k, and compress it,
+that won't give the same data as if we compress it and then drop the first 4k of
+the compressed data.
+
+The approach almost works though, if we take that offset and use it as part of
+our nonce: what we want to do is construct a function that will output the same
+nonce iff two extents (fragments of the same original extent) really are the
+same data.
+
+Offset into the original extent works in the absence of compression - two
+fragments with the same offset but different sizes will be equal in their common
+prefix, ignoring compression. We can handle compression if we also include both
+the current size, and the current compression function - offset + current size
+uniquely determines the uncompressed data, so, offset + current size +
+compression function will uniquely determine the compressed output.
+
+#### Nonce reuse on startup
+
+After recovery, we must ensure we don't reuse existing version numbers - we must
+ensure that newly allocated version numbers are strictly greater than any
+version number that has every been used before.
+
+The problem here is that we use the version number to write the data before
+adding the extent with that version number to the btree: after unclean shutdown,
+there will have been version numbers used to write data for which we have no
+record in the btree.
+
+The rigorous solution to this is to add a field (likely to the journal header)
+that indicates version numbers smaller than that field may have been used.
+However, we don't do that yet - it's not completely trivial since it'll add
+another potential dependency in the IO path that needs some analysis.
+
+The current solution implemented by the code is to scan every existing version
+number (as part of an existing pass), and set the next version number to
+allocate to be 64k greater than the highest existing version number that was
+found.
diff --git a/Transactions.mdwn b/Transactions.mdwn
new file mode 100644
index 0000000..3667758
--- /dev/null
+++ b/Transactions.mdwn
@@ -0,0 +1,212 @@
+
+## This was originally posted on Patreon, reproduced here for the background on how transactions/atomicity works in bcachefs:
+
+So, for some background: every remotely modern filesystem has some sort of
+facility for what database people just call transactions - doing complex
+operations atomically, so that in the event of a crash the operation either
+happened completely or not at all - we're never left in a halfway state.
+
+
+For example, creating a new file requires doing a few different things:
+
+- allocating a new inode
+
+- creating a new dirent
+
+- when creating a new directory, updating i_nlinks in the parent directory
+
+
+If you were to crash halfway through, your filesystem is now inconsistent:
+you've either got a new inode that doesn't have any dirents pointing to it, or
+a dirent that points to a nonexistent inode, and detecting and fixing this
+requires a global scan of every inode and dirent in the filesystem. Hence why
+older filesystems required fsck after unclean shutdown.
+
+
+The first journalling filesystems (e.g. ext3) took one approach to solving
+this: Essentially, for every complex operation that they want to make atomic,
+when they start that operation they first write a description of what they're
+about to do to a central log - the journal. E.g. "I'm creating a new file; I
+allocated inode number 3487 and I'm going to create a new dirent named foo in
+the directory bar". Then, after that description is written to the journal,
+they make all the required changes on disk - and once those are complete, they
+can write another entry in the journal indicating that the create has been
+finished.
+
+
+But none of the first journalling filesystems were really clean slate designs -
+they were all, to greater or lesser extent, bolting journalling onto a
+classical Unix filesystem design. As a result, these log entries in the journal
+end up getting rather complicated - they're essentially describing, in some
+detail, each and every operation a filesystem does that changes its metadata in
+some fashion. The code for managing the journal and replaying it on startup
+gets to be rather nontrivial too.
+
+
+There's a wide variety of techniques when it comes to filesystem journalling,
+and as a disclaimer I'm not terribly qualified to talk about what other
+filesystems do. Btrfs takes a completely different approach - naturally, its
+COW btrees are a large part of the story - but it too has grown rather
+complicated logging that I'm not at all qualified to talk about.
+
+
+Bcachefs takes a different approach - I don't know of any other filesystems
+that do anything like what bcachefs does, although I wouldn't be surprised if
+the techniques I'm using are well known in database land.
+
+
+Bcachefs currently has no real transactions. There is a journal, but its only
+purpose is to log btree updates - it's only used by the btree code, no other
+code makes direct use of it. And it doesn't log complex operations - all it
+logs are a list of keys to be inserted into the btree: journal replay literally
+consists of just re-inserting all the keys in the journal into the btree, in
+the same order as they were present in the journal:
+https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/journal.c?h=bcache-dev#n1218
+
+
+So then, how does the filesystem do complex operations - like a create or a
+rename - atomically?
+
+
+The first part is that bcachefs has a different design than other filesystems -
+you could say that bcachefs the filesystem sits on top of bcache, the key/value
+store. Other filesystems may have operations like "Allocate block a new block,
+add it to this inode's block map, update the inode's i_size and i_blocks to be
+consistent with the new block - oh, and this all has to be ordered correctly
+with writing the data to the new block". In bcachefs, every operation is just
+one or more keys to be inserted into the various btrees - the extents btree,
+the inodes btree, the dirents btree, et cetera. So, at the level of the
+filesystem code in bcachefs, the "complex operations" are drastically simpler
+than in other filesystems.
+
+
+So, to do those operations atomically, all we need is the ability to do
+multiple btree updates atomically - for example, to append to a file we update
+the extents btree with the new extent atomically with updating the inode in the
+inodes btree with the new i_size and i_blocks. For a rename, we create the new
+dirent atomically with deleting the old dirent (in bcache's btrees, a deletion
+is just an insertion of a key with type KEY_TYPE_DELETED - a whiteout - which
+overwrites the old key).
+
+
+For anyone who wants to peruse the code - I wouldn't call it particularly
+readable, but bch_btree_insert_trans() is where the magic happens. Here's the
+code that handles updating i_size and i_blocks when updating extents:
+
+https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/fs-io.c?h=bcache-dev#n263
+
+And rename:
+
+https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/dirent.c?h=bcache-dev#n190
+
+
+The really nice thing about this approach is that we're not logging complex
+operations, we're just doing them atomically - so on recovery, that rename or
+that append either fully completed or it never happened, we aren't messing with
+replaying creates or renames or appends. The journal replay code in bcachefs is
+no more complicated than in upstream bcache, it's still just reinserting a list
+of keys, one by one. That really is a nontrivial advantage - journal
+recovery/log replay code is particularly fiddly to test. That code is not going
+to get run and tested nearly as often as the rest of your filesystem code, so
+you really want as little of it as possible.
+
+
+Here's the journal replay code, by the way - note that it's using the normal
+btree update interface:
+
+https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/journal.c?h=bcache-dev#n1218
+
+
+Now, the downside: this approach only works for operations that require
+updating a small number of keys. That covers all the common, fast path stuff in
+a filesystem - if it requires updating many keys, it better not be in your fast
+path! But it doesn't quite cover everything.
+
+
+fcollapse() is a good example. For those who aren't aware of what it does, it
+allows you to delete a range in the middle of a file - punching out a hole -
+but also shifting all the data above the hole down, so that there's no gap.
+
+
+Right now, fcollapse() in bcachefs is not atomic. If you crash in the middle of
+a collapse operation, your filesystem will be fine - it'll be consistent, and
+you won't run into any filesystem errors - but the file you were running
+fcollapse() is going to be left somewhere in the middle, with some of the data
+shifted down and some not. In other words, corrupted, and not what you wanted.
+
+
+So, what's the plan?
+
+Well, we can put multiple keys in the same journal entry - that's how the
+existing bch_btree_insert_trans() works: 
+
+https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/btree_update.c?h=bcache-dev#n1576
+
+
+But, that doesn't mean we can stick arbitrary numbers of keys in the same
+journal entry. For one, the journal wasn't designed for it - a journal entry
+has to fit in a single journal write, and also while you hold a journal
+reservation you block that journal entry from being written; hold it too long
+and you'll eventually block other processes from getting new journal
+reservations. Also, the journal reservation must be acquired after acquiring
+the intent lock on the btree node(s) to be updated - otherwise, updates in the
+journal could appear in the wrong order, since the order things appear in the
+journal is determined by the order in which the reservations were acquired.
+
+
+So, we don't want to use the existing journal machinery for this. That's good,
+though - it does its job well, and if we tried retrofitting new more
+complicated types of transactions onto it we'd be making a lot of core
+functionality drastically more complicated.
+
+
+But note that we do have the ability to update multiple keys in multiple btrees
+simultaneously. What if we added a new "transactions" btree? A key in that
+btree would correspond to one of our new, heavyweight transactions: as the
+operation proceeded, whenever it did some work (updating keys in the various
+btrees) it would also update its transaction key with the current state.
+
+
+Here's the fcollapse() code - notice the main part of it consists of a loop
+while it walks all the extents to be moved:
+
+https://evilpiepirate.org/git/linux-bcache.git/tree/drivers/md/bcache/fs-io.c?h=bcache-dev#n1866
+
+
+It should be a relatively straightforward conversion - essentially, all we'd be
+doing is taking that function's various local variables that track where it is
+as it's looping over the extents and moving them into its transaction key, so
+that they also get persisted whenever it does some work.
+
+
+There's one annoying issue with this approach - one would expect that in
+typical filesystem usage, there would be a small number of outstanding
+transactions at any given time - so they'll all fit in the same btree node. But
+they'll all be updating that same btree node constantly, every time they do a
+bit of work - so we'd be introducing quite a bit of lock contention. Ick.
+
+
+But the keys are never going to be read at runtime - there wouldn't ever be any
+reason to do a lookup on any key in BTREE_ID_TRANSACTIONS at runtime - why have
+a btree at all?  Why not _only_ store them in the journal?
+
+
+That's the approach I'm taking. I'm adding a new btree id, but it's just so
+that the journal code and bch_btree_insert_trans() can distinguish them - these
+new transaction keys will only be persisted in the journal. You can check out
+my (in progress, definitely not working yet) code here:
+
+https://evilpiepirate.org/git/linux-bcache.git/log/?h=bcache-transactions
+
+
+Also, my main motivation for working on this wasn't fcollapse() - fcollapse is
+just a useful test case for the new approach. The immediate motivation is
+actually correctly implementing i_generation for nfs export support: nfs
+requires inodes have an i_generation field such that (inode nr, i_generation)
+uniquely identifies an inode for the life of a filesystem. But doing that
+correctly requires saving some additional state somewhere that we update
+atomically with creating a new inode (or else saving a placeholder with the
+current i_generation when we delete an inode, and reusing deleted inodes - but
+that will have major downsides when using snapshots). Snapshots are also going
+to require this new transaction machinery, but that's quite a ways off.
+
diff --git a/index.mdwn b/index.mdwn
index a4c60f9..7d69051 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -1,8 +1,11 @@
 
+# Bcachefs
+
 Bcachefs is an advanced new filesystem for Linux, with an emphasis on
 reliability and robustness:
 
 * Copy on write (COW) - like zfs or btrfs
+
 * Good performance - significantly better than existing copy on write filesystems, comparable to ext4/xfs
 * Metadata and data checksumming
 * Multiple devices, including replication and other types of RAID
@@ -12,3 +15,20 @@ reliability and robustness:
 * Snapshots
 * Scalable - has been tested to 50+ TB, will eventually scale far higher
 * Already working and stable, with a small community of users
+
+## Getting started
+
+Bcachefs is not yet upstream - you'll have to build a kernel to use it. 
+
+First, check out the bcache kernel and tools repositories:
+
+    git clone https://evilpiepirate.org/git/bcachefs.git
+    git clone https://evilpiepirate.org/git/bcachefs-tools.git
+
+Build and install as usual - make sure you enable `CONFIG_BCACHE_FS` Then, to
+format and mount a single device with the default options, run:
+
+    bcachefs format /dev/sda1
+    mount -t bcachefs /dev/sda1 /mnt
+
+See `bcachefs format --help` for more options.
diff --git a/sidebar.mdwn b/sidebar.mdwn
index 34f2926..abe1356 100644
--- a/sidebar.mdwn
+++ b/sidebar.mdwn
@@ -2,3 +2,5 @@
 * [[About]]
 * [[FAQ]]
 * [[Architecture]]
+ * [[Encryption]]
+ * [[Transactions]]
-- 
cgit v1.2.3