summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@linux.dev>2023-09-22 18:30:23 -0400
committerKent Overstreet <kent.overstreet@linux.dev>2023-09-22 18:30:23 -0400
commit1fec6f13837263d5abc35a87a78999fd26eb580f (patch)
tree9c5a53f684dac0f6d7fa760bc5919d54d98063d5
parenta04ce4a3391c24e44dadd0d8e54fd4a52173e135 (diff)
More website improvements
- expand, reorg frontpage - update roadmap - new debugging page - new btree perf numbers Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-rw-r--r--BtreePerformance.mdwn13
-rw-r--r--Debugging.mdwn98
-rw-r--r--Roadmap.mdwn435
-rw-r--r--Wishlist.mdwn (renamed from Todo.mdwn)70
-rw-r--r--index.mdwn90
-rw-r--r--sidebar.mdwn2
6 files changed, 339 insertions, 369 deletions
diff --git a/BtreePerformance.mdwn b/BtreePerformance.mdwn
new file mode 100644
index 0000000..ff3fa3c
--- /dev/null
+++ b/BtreePerformance.mdwn
@@ -0,0 +1,13 @@
+
+Here's some btree microbenchmarks, taken September 2023, on a Ryzen 5950X,
+turbothreading and hyperthreading off:
+
+ rand_insert: 10M with 1 threads in 15 sec, 1480 nsec per iter, 660K per sec
+ rand_insert: 10M with 16 threads in 4 sec, 6330 nsec per iter, 2.41M per sec
+ rand_lookup: 10M with 1 threads in 7 sec, 736 nsec per iter, 1.29M per sec
+ rand_lookup: 10M with 16 threads in 0 sec, 1165 nsec per iter, 13.1M per sec
+
+ rand_insert: 100M with 1 threads in 165 sec, 1573 nsec per iter , 621K per sec
+ rand_insert: 100M with 16 threads in 31 sec, 4849 nsec per iter, 3.15M per sec
+ rand_lookup: 100M with 1 threads in 89 sec, 851 nsec per iter, 1.12M per sec
+ rand_lookup: 100M with 16 threads in 6 sec, 973 nsec per iter, 15.7M per sec
diff --git a/Debugging.mdwn b/Debugging.mdwn
new file mode 100644
index 0000000..6dd9054
--- /dev/null
+++ b/Debugging.mdwn
@@ -0,0 +1,98 @@
+# An overview of bcachefs debugging facilities
+
+Everything about the internal operation of the system should be easily visible
+at runtime, via sysfs, debugfs or tracepoints. If you notice something that
+isn't sufficiently visible, please file a bug.
+
+If something goes wonky or is behaving unexpectedly, there should be enough
+information readily and easily available at runtime to understand what bcachefs
+is doing and why.
+
+Also, when an error occurs, the error message should print out _all_ the
+relevant information we have; it should print out enough information for the
+issue to be debugged, without hunting for more.
+
+And if something goes really wrong and fsck isn't able to recover, there should
+be tooling for working with the developers to get that fixed, too.
+
+## Runtime facilities
+
+For inspection of a running bcachefs filesystem, including questions like "what
+is my filesystem doing and why?", we have:
+
+ - sysfs: `/sys/fs/bcachefs/<uuid>/`
+
+ Here we've got basic information about the filesystem and member devices.
+ There's also an `options` directory which allows filesystem options to be
+ set and queried at runtime, a `time_stats` with statistics on various events
+ we track latency for, and an `internal` directory with additional debug info.
+
+ - debugfs: `/sys/kernel/debug/bcachefs/<uuid>/`
+
+ Debugfs also shows the full contents of every btree - all metadata is a key
+ in a btree, so this means all filesystem metadata is inspectable here.
+ There's additional per-btree files that show other useful btree information
+ (how full are btree nodes, bkey packing statistics, etc.).
+
+ - tracepoints and counters
+
+ In addition to the usual tracepoints, we keep persistent counters for every
+ tracepoint event, so that it's possible to see if slowpath events have been
+ occuring without tracing having been previously enabled.
+
+ `/sys/fs/bcachefs/<uuid>/counters` shows, for every event, the number of
+ events since filesystem creation, and since mount.
+
+## Hints on where to get started
+
+Is something spinning? Does the system appear to be trying to get work done,
+without getting anything done?
+
+Check `top`: this shows CPU usage by thread - is something spinning?
+
+Check `perf top`: this shows CPU usage, broken out by function/module - what code is spinning?
+
+Check `perf top -e bcachefs:*`: this shows counters for all bcachefs events - are we hitting a rare or slowpath event?
+
+Is everything stuck?
+
+Check `btree_transactions` in debugfs -
+`/sys/kernel/debug/bcachefs/<uuid>/btree_transactions`; other files there may
+also be relevant.
+
+Is something stuck?
+
+Check sysfs `dev-0/alloc_debug`: this shows various internal allocator state -
+perhaps the allocator is stuck?
+
+Something funny with rebalance/background data tasks?
+
+Check sysfs `internal/rebalance_work`, `internal/moving_ctxts`
+
+All of this stuff could use reorganizing and expanding, of course.
+
+## Offline filesystem inspection
+
+The `bcachefs list` subcommand lists the contents of the btrees - extents, inodes, dirents, and more.
+
+The `bcachefs list_journal` subcommand lists the contents of the journal. This
+can be used to discover what operation caused an error, e.g. reported by fsck,
+by searching for the transaction that last updated those key(s).
+
+### Unrepairable filesystem debugging
+
+If there's an issue that fsck can't fix, use the `bcachefs dump` subcommand,
+and then [[magic wormhole|https://github.com/magic-wormhole/magic-wormhole]],
+to send your filesystem metadata to the developers.
+
+## For the developer
+
+Internally, bcachefs uses `printbufs` for formatting text in a generic and
+structured way, and we try to write `to_text()` functions for as many types as
+possible.
+
+This makes it much easier to write good error messages, and add new debug tools
+to sysfs/debugfs; when `to_text()` functions already exist for all the relevant
+types, this work is much easier.
+
+Try to keep up with and extend this approach when working with the code.
diff --git a/Roadmap.mdwn b/Roadmap.mdwn
index 8106c9b..1e926e9 100644
--- a/Roadmap.mdwn
+++ b/Roadmap.mdwn
@@ -1,232 +1,82 @@
-# bcachefs status, roadmap:
+# Roadmap for future work:
[[!toc levels=3 startlevel=2 ]]
-## Stabilization - status, work to be done
-
-### Current stability/robustness
-
-We recently had a spate of corruptions that users were hitting - at the end of
-it, some users had to wait for new repair code to be written but (as far as I
-know) no filesystems or even significant amounts of data were lost - our check
-and repair code is getting to be really solid. The bugs themselves seem to have
-been resolved, except for (it appears) a bug in the journalling code that was
-causing journal entries to get lost - that one hasn't been reported in awhile
-but we'll still hunt for it some more.
-
-Users have been pushing it pretty hard.
-
-### Tooling
-
-The bcachefs kernel code is also included as a library in bcachefs-tools - 90%
-of the kernel code builds and runs in userspace. This is really useful - it
-makes new tooling easy to develop, and it also means we can test almost all of
-the codebase in userspace, with userspace debugging tools (ASAN and UBSAN in
-userspace catch more bugs than the kernel equivalents, and valgrind also catches
-things ASAN doesn't).
-
-We've got a tool for dumping filesystem metadata as qcow2 images - our normal
-workflow when a user has something go wrong with their filesystem is for them to
-dump the metadata, and then if necessary I can write new repair code and test it
-on the dumped image before touching their filesystem.
-
-A fuse port has been started, but isn't useable yet. This is something we really
-want to have - if a user or customer is hitting a bug that we can't reproduce
-locally, switching to the fuse version to debug it in userspace is going to save
-our bacon someday.
-
-We also have tooling for examining filesystem metadata - there's userspace
-tooling for examining an offline filesystem, and metadata can also be viewed in
-debugfs on a mounted filesystem.
-
-One thing that's missing in the userspace bcachefs tool is access to what's
-published via sysfs when running in the kernel - this is a really important
-debugging tool. A good project for someone new to the code might be to embed an
-http server and implement a sysfs shim.
-
-### Feature stability
-
-Erasure coding isn't quite there yet - there's still oopses being reported, and
-existing tests aren't finding the bugs.
-
-Everything else is considered supported. There's still a bug with the refcounts
-in the reflink code that xfstests sporadically hits, but that's the current area
-of focus and should be closed out soon.
-
-### Test infrastructure
-
-We have [[ktest|https://evilpiepirate.org/git/ktest.git]], which is a major asset.
-It turns virtual machine testing into a simple commandline tool - all test
-output is on standard output, ctrl-c kills the VM and releases all resources.
-It's designed for both interactive development (it tries to be as quick as
-possible from kernel build to when the tests start running) and automated
-testing (test pass/failure is reported as a return code, and implements a
-watchdog in case tests hang).
-
-It provides easy access to ssh, kgdb, qemu's gdb interface, and more. In the
-past we've had it working with the kernel's gcov support for code coverage,
-getting that going again is high on the todo list.
-
-### Testing:
-
-All tests need to be passing. Tests right now are divided between xfstests and
-ktest - ktest is used as a wrapper for running xfstests in a virtual machine,
-and it also has a good number of bcachefs-specific tests.
-
-We may want to move the bcachefs tests from ktest to xfstests - having all our
-tests in the same place would be more convenient and thus help ensure that
-they're getting run.
-
-We're down to ~15 xfstests tests that aren't passing - all are triaged, none
-are particularly concerning. We need to get all of them passing and then start
-leaving test runs going 24/7 - when the test dashboard is all green, that makes
-it much easier to notice and jump on tests that fail even sporadically.
-
-The ktest tests haven't been getting run as much and are in a messier state - we
-need to get all of them passing and ensure that they're being run continuously
-(possibly moving them to xfstests).
-
-We don't yet have an automated mechanism for running xfstests with the full
-matrix of configurable options - we want to be testing with checksumming both on
-and off, data compression on and off, encryption on and off, small btree nodes
-(to stress greater tree heights), and more.
-
-### Test coverage
-
-We need to get code coverage analysis going again - this will definitely
-highlight tests that need to be written, and we also need to add error injection
-testing. bcache/bcachefs used to have error injection tests, but it was with a
-nonstandard error injection framework - some of the error injection points still
-exist, though.
-
-## Performance:
-
-### Core btree:
-
-The core btree code is heavily optimized, and I don't expect much in the way of
-significant gains without novel data structures. Btree performance is
-particularly important in bcachefs because we not shard btrees on a per inode
-basis like most filesystems. Instead have one global extents btree, one global
-dirents btree, et cetera. Historically, this came about because bcachefs started
-as a block cache where we needed a single index that would be fast enough to
-index an entire SSD worth of blocks, and for persistent data structures btrees
-have a number of advantages over hash tables (can do extents, and when access
-patterns have locality they preserve that locality). Using singleton btrees is
-a major simplification when writing a filesystem, but it does mean that
-everything is leaning rather heavily on the performance and scalability of the
-btree implementation.
-
-Bcachefs b-trees are a hybrid compacting data structure - they share some
-aspect with COW btrees, but btree nodes internally are log structured. This
-means that btree nodes are big (default 256k), and have a very high fanout. We
-keep multiple sorted sets of keys within a btree node (up to 3),
-compacting/sorting when the last set, the one that updates go to, becomes too
-large.
-
-Lookups within a btree node use eytzinger layout search trees with summary keys
-that fit in 4 bytes. Eytzinger layout means descendents of any given node are
-adjacent in memory, meaning they can be effectively prefetched, and 4 byte nodes
-mean 16 of them fit on a cacheline, which means we can prefetch 4 levels ahead -
-this means we can pipeline effectively enough to get good lookup performance
-even when our memery accesses are not cached.
-
-On my Ryzen 3900X single threaded lookups across 16M keys go at 1.3M second, and
-scale almost linearly - with 12 threads it's about 12M/second. We can do 16M
-random inserts at 670k per second single threaded; with 12 threads we achieve
-2.4M random inserts per second.
-
-Btree locking is very well optimized. We have external btree iterators which
-make it easy to aggressively drop and retake locks, giving the filesystem as a
-whole very good latency.
-
-Minor scalability todo item: btree node splits/merges still take intent locks
-all the way up to the root, they should be checking how far up they need to go.
-This hasn't shown up yet because bcachefs btree nodes are big and splits/merges
-are infrequent.
-
-### IO path:
-
-IO paths: the read path looks to be in excellent shape - I see around 350k iops
-doing async 4k random reads, single threaded, on a somewhat pokey Xeon server.
-We're nowhere near where we should be on 4k random writes - I've been getting in
-the neighborhood of 70k iops lately, single threaded, and it's unclear where the
-bottleneck is - we don't seem to be cpu bound. Considering the btree performance
-this shouldn't be a core architectural issue - work needs to be done to identify
-where we're losing performance.
-
-Filesystem metadata operation performance (e.g. fsmark): On most benchmarks, we
-seem to be approaching or sometimes mathing XFS on performance. There are
-exceptions: multithreaded rm -rf of empty files is currently roughly 4x slower
-than XFS - we bottleneck on journal reclaim, which is flushing inodes from the
-btree key cache back to the inodes btree.
-
-## Scalability:
-
-There are still some places where we rely on walking/scanning metadata too much.
-
-### Allocator:
-
-For allocation purposes, we divide up the device into buckets (default 512k,
-currently support up to 8M). There is an in-memory array that represents these
-buckets, and the allocator periodically scans this array to fill up freelists -
-this is currently a pain point, and eliminating this scanning is a high priority
-item. Secondarily, we want to get rid of this in memory array - buckets are
-primarily represented in the alloc btree. Almost all of the work to get rid of
-this has been done, fixing the allocator to not rescan for empty buckets is the
-last major item.
-
-### Copygc:
-
-Because allocation is bucket based, we're subject to internal fragmentation that
-we need copygc to deal with, and when copygc needs to run it has to scan the
-bucket arrays to determine which buckets to evacuate, and then it has to scan
-the extents and reflink btrees for extents to be moved.
-
-This is something we'd like to improve, but is not a major pain point issue -
-being a filesystem, we're better positioned to avoid mixing unrelated writes
-which tends to be the main cause of write amplification in SSDs. Improving this
-will require adding a backpointer index, which will necessarily add overhead to
-the write path - thus, I intend to defer this until after we've diagnosed our
-current lack of performance in the write path and worked to make is fast as we
-think it can go. We may also want backpointers to be an optional feature.
-
-### Rebalance:
+## Rebalance work index:
Various filesystem features may want data to be moved around or rewritten in the
background: e.g. using a device as a writeback cache, or using the
`background_compression` option - excluding copygc, these are all handled by the
rebalance thread. Rebalance currently has to scan the entire extents and reflink
-btrees to find work items - and this is more of a pain point than copygc, as
-this is a normal filesystem feature. We should add a secondary index for extents
-that rebalance will need to move or rewrite - this will be a "pay if you're
-using it" feature so the performance impact should be less of a concern than
-with copygc.
-
-Some thought and attention should be given to what else we might want to use
-rebalance for in the future.
-
-There are also current bug reports of rebalance reading data and not doing any
-work; while work is being done on this subsystem we should think about improving
-visibility as to what it's doing.
-
-### Disk space accounting:
-
-Currently, disk space accounting is primarily persisted in the journal (on clean
-shutdown, it's also stored in the superblock) - and each disk space counter is
-added to every journal entry. As the number of devices in a filesystem increases
-this overhead grows, but where this will really become an issue is adding
-per-snapshot disk space accounting. We really need these counters to be stored
-in btree keys, but this will be a nontrivial change as we use a complicated
-system of percpu counters for disk space accounting, with multiple sets of
-counters for each open journal entry - rolling them up just prior to each
-journal write. My plan is to extend the btree key cache so that it can support a
-similar mechanism.
-
-Also, we want disk space accounting to be broken by compression type, for
-compressed data, and compressed/uncompressed totals to both be kept - right now
-there's no good way for users to find out the compression ratio they're getting.
-
-### Number of devices in a filesystem:
+btrees to find work items: in the near future, we'll be adding a
+`rebalance_work` index, which will just be a bitset btree indicating which
+extents need to have work done on them.
+
+The btree write buffer, which e.g. the backpointers code heavily relies upon,
+gives us the ability to maintain this index with minimal runtime overhead.
+
+We also have to start tracking the pending work in the extent itself, with a
+new extent field (e.g. compress with given compression type, for the
+`background_compression` option, or move to different target for the
+`background_target` option).
+
+Most of the design and implementation for this is complete - there is still
+some work related to the triggers code remaining.
+
+## Compression
+
+We'd like to add multithreaded compression, for when a single thread is doing
+writes at a high enough rate to be CPU bound.
+
+We'd also like to add support for xz compression: this will be useful when
+doing a "squashfs-like" mode. The blocker is that kernel only has support for
+xz decompression, not compression - hopefully we can get this added.
+
+## Scrub
+
+We're still missing scrub support. Scrub's job will be to walk all data in the
+filesystem and verify checksums, recovery bad data from a good copy if it
+exists or notifying the user if data is unrecoverable.
+
+Since we now have backpointers, we have the option of doing scrub in keyspace
+order, or lba order - lba order will be much better for rotational devices.
+
+## Configurationless tiered storage
+
+We're starting to work on benchmarking individual devices at format time, and
+storing their performance characteristics in the superblock.
+
+This will allow for a number of useful things - including better tracking of
+drive health over time - but the real prize will be tiering that doesn't
+require explicit configuration: foreground writes will preferentially go to
+the fastest device(s), and in the background data will be moved from fast
+devices to slow devices, taking into account how hot or cold that data is.
+
+Even better, this will finally give us a way to manage more than two tiers, or
+performance classes, of devices.
+
+## Disk space accounting:
+
+Currently, disk space accounting is primarily persisted in the journal (on
+clean shutdown, it's also stored in the superblock), and each disk space
+counter is added to every journal entry.
+
+So currently there's a real limit on how many counters we can maintain, and
+we'll need to do something else before adding e.g. per snapshot disk space
+accounting.
+
+We'll need to make disk space accounting keys in a btree, like all other
+metadata. Since disk space accounting was implemented we've gained the btree
+write buffer code, so this is now possible: on transaction commit, we will do
+write buffer updates with deltas for the counters being modified, and then the
+btree write buffer flush code will handle summing up the deltas and flushing
+the updates to the underlying btree.
+
+We also want to add more counters to the inode. Right now in the inode we only
+track sectors used as fallocate would see them - we should add counters that
+take into account replication and compression.
+
+## Number of devices in a filesystem:
Currently, we're limited to a maximum of 64 devices in a filesystem. It would be
desirable and not much work to lift that limit - it's not encoded in any on disk
@@ -265,98 +115,97 @@ which means that the failure of _any_ three devices would lead to loss of data.
A feature that addresses these issues still needs to be designed.
-### Fsck
+### Support for sub-block size writes
-Fsck performance is good, but for supporting huge filesystems we are going to
-need online fsck.
+Devices with larger than 4k blocksize are coming, and this is going to result
+in more wasted disk space due to internal fragmentation.
-Fsck in bcachefs is divided up into two main components: there is `btree_gc`,
-which walks the btree and regenerates allocation information (and also checks
-btree topology), and there is fsck proper which checks higher level filesystem
-invariants (e.g. extents/dirents/xattrs must belong to an inode, and filesystem
-connectivity).
+Support for writes at smaller than blocksize granularity would be a useful
+addition. The read size is easy: they'll require bouncing, which we already
+support. Writes are trickier - there's a few options:
-The `btree_gc` subsystem is so named because originally (when bcachefs was
-bcache) not only did not have persistent allocation information, it also didn't
-keep bucket sector counts up to date when adding/removing extents from the btree
-- it relied entirely on periodically rescanning the btree.
+ * Buffer up writes when we don't have full blocks to write? Highly
+ problematic, not going to do this.
+ * Read modify write? Not an option for raw flash, would prefer it to not be
+ our only option
+ * Do data journalling when we don't have a full block to write? Possible
+ solution, we want data journalling anyways
-This capability was kept for a very long time, partly as an aid to testing an
-debugging when uptodate allocation information was being developed, then
-persistent, then transactionally persistent allocation information was
-developed. The capability to run `btree_gc` at runtime has been disabled for
-roughly the past year or so - it was a bit too much to deal with while getting
-everything else stabilized - but the code has been retained and there haven't
-been any real architectural changes that would preclude making it a supported
-feature again - this is something that we'd like to do, after bcachefs is
-upstreamed and when more manpower is available.
+### Journal on NVRAM devices
-Additionally, fsck itself is part of the kernel bcachefs codebase - the
-userspace fsck tool uses a shim layer to run the kernel bcachefs codebase in
-userspace, and fsck uses the normal btree iterators interface that the rest of
-the filesystem uses. The in-kernel fsck code can be run at mount time by
-mounting wiht the `-o fsck` option - this is essentially what the userspace
-fsck tool does, just running the same code in userspace.
+NVRAM devices are currently a specialized thing, which are always promising to
+become more widely available. We could make use of them for the journal;
+they'd let us cheaply add full data journalling (useful for previous feature)
+and greatly reduce fsync overhead.
-This means that there's nothing preventing us from running the current fsck
-implementation at runtime, while the filesystem is in use, and there's actually
-not much work that needs to be done for this to work properly and reliably: we
-need to add locking for the parts of fsck are checking nonlocal invariants and
-can't rely solely on btree iterators for locking.
+### Online fsck
-Since backpointers from inodes to dirents were recently added, the checking of
-directory connectivity is about to become drastically simpler and shouldn't need
-any external data structures or locking with the rest of the filesystem. The
-only fsck passes that appear to require new locking are:
+For supporting huge filesystems, and competing with XFS, we will eventually
+need online fsck.
-* Counting of extents/dirents to validate `i_sectors` and subdirectory counts.
+This will be easier in bcachefs than it was in XFS, since our fsck
+implementation is part of the main filesystem codebase and uses the normal
+btree transaction API that's available at runtime.
-* Checking of `i_nlink` for inodes with hardlinks - but since inode backpointers
- have now been added, this pass now only needs to consider inodes that actually
- have hardlinks.
+That means that the current fsck implementation will eventually also be the
+online fsck implementation: the process will mostly be a matter of auditing the
+existing fsck codebase for locking issues, and adding additional locking w.r.t.
+the rest of the filesystem where needed.
-So that's cool.
+## ZNS, SMR device support:
-### Deleted inodes after unclean shutdown
+[[ZNS|https://zonedstorage.io/docs/introduction/zns]] is a technology for
+cutting out the normal normal flash translation layer (FTL), and instead
+exposing an interface closer to the raw flash.
-Currently, we have to scan the inodes btree after an unclean shutdown to find
-unlinked inodes that need to be deleted, and on large filesystems this is the
-slowest part of the mount process (after unclean shutdown).
+SSDs have large erase units, much larger than the write blocksize, so the FTL
+has to implement a mapping layer as well as garbage collection to expose a
+normal, random write block device interface. Since a copy-on-write filesystem
+needs both of these anyways, adding ZNS support to bcachefs will eliminate a
+significant amount of overhead and improve performance, especially tail latency.
-We need to add a hidden directory for unlinked inodes.
+[[SMR|https://en.wikipedia.org/wiki/Shingled_magnetic_recording]] is another
+technology for rotational hard drives, completely different from ZNS in
+implementation but quite similar in how it is exposed and used. SMR enables
+higher density, and supporting it directly in the filesystem eliminates much of
+the performance penalties.
-### Snapshots:
+bcachefs allocation is bucket based: we allocate an entire bucket, write to it
+once, and then never rewrite it until the entire bucket is empty and ready to
+be reused. This maps nicely to both ZNS and SMR zones, which makes supporting
+both of these directly particularly attractive.
-I expect bcachefs to scale well into the millions of snapshots. There will need
-to be some additional work to make this happen, but it shouldn't be much.
+## Squashfs mode
-* We have a function, `bch2_snapshot_is_ancestor()`, that needs to be fast -
- it's called by the btree iterator code (in `BTREE_ITER_FILTER_SNAPSHOTS` mode)
- for every key it looks at. The current implementation is linear in the depth
- of the snapshot tree. We'll probably need to improve this - perhaps by caching
- a bit vector, for each snapshot ID, of ancestor IDs.
+Righ now we have dedicated filesystems, like
+[[squashfs|https://en.wikipedia.org/wiki/SquashFS]], for building and reading
+from a highly compressed but read-only filesystem.
-* Too many overwrites in different snapshots at the same part of the keyspace
- will slow down lookups - we haven't seen this come up yet, but it certainly
- will at some point.
+bcachefs metadata is highly space efficient(e.g. [[packed
+bkeys|https://bcachefs.org/Architecture/#Bkey_packing]], and could probably
+serve as a direct replacement with a relatively small amount of work to
+generate a filesystem image in the most compact way. The same filesystem image
+could later be mounted read-write, if given more space.
- When this happens, it won't be for most files in the filesystem, in anything
- like typical usage - most files are created, written to once, and then
- overwritten with a rename; it will be for a subset of files that are long
- lived and continuously being overwritten - i.e. databases.
+Specifying a directory tree to include when creating a new filesystem is
+already a requested feature - other filesystems have this as the `-c` option
+to mkfs.
- To deal with this we'll need to implement per-inode counters so that we can
- detect this situation - and then when we detect too many overwrites, we can
- allocate a new inode number internally, and move all keys that aren't visible
- in the main subvolume to that inode number.
+## Container filesystem mode
-## Features:
+Container filesystems are another special purpose type of filesystem we could
+potentially fold into bcachefs - e.g.
+[[puzzlefs|https://github.com/project-machine/puzzlefs]]
-### SMR device support:
+These have a number of features, e.g. verity support for cryptographically
+verifying a filesystem image (which we can do via our existing cryptographic
+MAC support) - but the interesting feature here is support for referring to
+data (extents, entire files) in a separate location - in this case, the host
+filesystem.
-This is something I'd like to see added - bcachefs buckets map nicely to SMR
-zones, we just need support for larger buckets (on disk format now supports
-them, we'll need to add an alternate in memory represenatation for large
-buckets), and plumbing to read the zone pointers and use them in the allocator.
+### Cloud storage management
-We already have copygc, so everything should just work.
+The previous feature is potentially more widely useful: allowing inodes and
+extents to refer to arbitrary callbacks for fetching/storing data would be the
+basis for managing data in e.g. cloud storage, and in combination with local
+storage in a seamless tierid system.
diff --git a/Todo.mdwn b/Wishlist.mdwn
index 06fbdb5..44a48a0 100644
--- a/Todo.mdwn
+++ b/Wishlist.mdwn
@@ -1,72 +1,4 @@
-# TODO
-
-## Kernel developers
-
-### P0 features
-
- * Disk space accounting rework: now that we've got the btree write buffer, we
- can store disk space accounting in btree keys (instead of the current hacky
- mechanism where it's only stored in the journal).
-
- This will make it possible to store many more counters - meaning we'll be
- able to store per-snapshot-id accounting.
-
- We also need to add separate counters for compressed/uncompressed size, and
- broken out by compression type.
-
- It'd also be really nice to add more counters to the inode for
- compressed/uncompressed size. Right now the inode just has `bi_sectors`,
- which the ondisk size for fallocate purposes, i.e. discounting replication.
- We could add something like
-
- bi_sectors_ondisk_compressed
- bi_sectors_ondisk_uncompressed
-
-### P1 features
-
-### P2 features
-
- * When we're using compression, we end up wasting a fair amount of space on
- internal fragmentation because compressed extents get rounded up to the
- filesystem block size when they're written - usually 4k. It'd be really nice
- if we could pack them in more efficiently - probably 512 byte sector
- granularity.
-
- On the read side this is no big deal to support - we have to bounce
- compressed extents anyways. The write side is the annoying part. The options
- are:
- * Buffer up writes when we don't have full blocks to write? Highly
- problematic, not going to do this.
- * Read modify write? Not an option for raw flash, would prefer it to not be
- our only option
- * Do data journalling when we don't have a full block to write? Possible
- solution, we want data journalling anyways
-
- * Full data journalling - we're definitely going to want this for when the
- journal is on an NVRAM device (also need to implement external journalling
- (easy), and direct journal on NVRAM support (what's involved here?)).
-
- Would be good to get a simple implementation done and tested so we know what
- the on disk format is going to be.
-
-## Developers
-
- * End user documentation needs a lot of work - complete man pages, etc.
-
- * bcachefs-tools needs some fleshing out in the --help department
-
-## Users
-
- * Benchmark bcachefs performance on different configurations:
- (Note that these tests will have to be repeated in the future.)
- * SSD
- * HDD
- * SSD + HDD
-
- * Update the website / Documentation
- * Ask questions (so and we can update the documentation / website).
-
-## Wishlist
+# Wishlist items
* "Seed devices", hard-readonly devices that are CoWed from on write (btrfs
has this; useful for base devices for virtualization, among other things).
diff --git a/index.mdwn b/index.mdwn
index fb0ccb7..13c18e5 100644
--- a/index.mdwn
+++ b/index.mdwn
@@ -1,4 +1,3 @@
-
# "The COW filesystem for Linux that won't eat your data".
Bcachefs is an advanced new filesystem for Linux, with an emphasis on
@@ -17,8 +16,28 @@ from a modern filesystem.
* Scalable - has been tested to 100+ TB, expected to scale far higher (testers wanted!)
* High performance, low tail latency
* Already working and stable, with a small community of users
+* [[Use Cases|UseCases]]
+
+## Documentation
+
+* [[Getting Started|GettingStarted]]
+* [[User manual|bcachefs-principles-of-operation.pdf]]
+* [[FAQ]]
+
+## Debugging tools
-## Philosophy
+bcachefs has extensive debugging tools and facilities for inspecting the state
+of the system while running: [[Debugging]].
+
+## Development tools
+
+bcachefs development is done with
+[[ktest|https://evilpiepirate.org/git/ktest.git]], which is used for both
+interactive and automated testing, with a large test suite.
+
+[[Test dashboard|https://evilpiepirate.org/~testdashboard/ci]]
+
+## Philosophy, vision
We prioritize robustness and reliability over features: we make every effort to
ensure you won't lose data. It's building on top of a codebase with a pedigree
@@ -28,17 +47,76 @@ keeping things stable, and aggressively fixing design issues as they are found;
the bcachefs codebase is considerably more robust and mature than upstream
bcache.
-## Documentation
+The long term goal of bcachefs is to produce a truly general purpose filesystem:
+ - scalable and reliable for the high end
+ - simple and easy to use
+ - an extensible and modular platform for new feature development, based on a
+ core that is a general purpose database, including potentially distributed storage
-* [[Getting Started|GettingStarted]]
-* [[User manual|bcachefs-principles-of-operation.pdf]]
-* [[FAQ]]
+## Some technical high points
+
+Filesystems have conventionally been implemented with a great deal of special
+purpose, ad-hoc data structures; a filesystem-as-a-database is a rarer beast:
+
+### btree: high performance, low latency
+
+The core of bcachefs is a high performance, low latency b+ tree.
+[[Wikipedia|https://en.wikipedia.org/wiki/Bcachefs]] covers some of the
+highlights - large, log structured btree nodes, which enables some novel
+performance optimizations. As a result, the bcachefs b+ tree is one of the
+fastest production ordered key value stores around: [[benchmarks|BtreePerformance]].
+
+Tail latency has also historically been a difficult area for filesystems, due
+largely to locking and metadata write ordering dependencies. The database
+approach allows bcachefs to shine here as well, it gives us a unified way to
+handle locking for all on disk state, and introduce patterns and techniques for
+avoiding aforementioned dependencies - we can easily avoid holding btree locks
+while doing blocking operations, and as a result benchmarks show write
+performance to be more consistant than even XFS.
+
+#### Sophisticated transaction model
+
+The main interface between the database layer and the filesystem layer provides
+ - Transactions: updates are queued up, and are visible to code running within
+ the transaction, but not the rest of the system until a successful
+ transaction commit
+ - Deadlock avoidance: High level filesystem code need not concern itself with lock ordering
+ - Sophisticated iterators
+ - Memoized btree lookups, for efficient transaction restart handling, as well
+ as greatly simplifying high level filesystem code that need not pass
+ iterators around to avoid lookups unnecessarily.
+
+#### Triggers
+
+Like other database systems, bcachefs-the-database provides triggers: hooks run
+when keys enter or leave the btree - this is used for e.g. disk space
+accounting.
+
+Coupled with the btree write buffer code, this gets us highly efficient
+backpointers (for copygc), and in the future and efficient way to maintain an
+index-by-hash for data deduplication.
+
+### Unified codebase
+
+The entire bcachefs codebase can be built and used either inside the kernel, or
+in userspace - notably, fsck is not a from-scratch implementation, it's just a
+small module in the larger bcachefs codebase.
+
+### Rust
+
+We've got some initial work done on transitioning to Rust, with plans for much
+more: here's an example of walking the btree, from Rust:
+[[cmd_list|https://evilpiepirate.org/git/bcachefs-tools.git/tree/rust-src/src/cmd_list.rs]]
## Contact and support
Developing a filesystem is also not cheap, quick, or easy; we need funding!
Please chip in on [[Patreon|https://www.patreon.com/bcachefs]]
+We're also now offering contracts for support and feature development -
+[[email|kent.overstreet@gmail.com]] for more info. Check the
+[[roadmap|Roadmap]] for ideas on things you might like to support.
+
Join us in the bcache [[IRC|Irc]] channel, we have a small group of bcachefs
users and testers there: #bcache on OFTC (irc.oftc.net).
diff --git a/sidebar.mdwn b/sidebar.mdwn
index fe67d26..e94c7fe 100644
--- a/sidebar.mdwn
+++ b/sidebar.mdwn
@@ -9,7 +9,7 @@
* [[Allocator]]
* [[Fsck]]
* [[Roadmap]]
- * [[Todo]]
+ * [[Wishlist]]
* [[TestServerSetup]]
* [[StablePages]]
* [[October 2022 talk for Redhat|bcachefs_talk_2022_10.mpv]]