diff options
-rw-r--r-- | BtreePerformance.mdwn | 13 | ||||
-rw-r--r-- | Debugging.mdwn | 98 | ||||
-rw-r--r-- | Roadmap.mdwn | 435 | ||||
-rw-r--r-- | Wishlist.mdwn (renamed from Todo.mdwn) | 70 | ||||
-rw-r--r-- | index.mdwn | 90 | ||||
-rw-r--r-- | sidebar.mdwn | 2 |
6 files changed, 339 insertions, 369 deletions
diff --git a/BtreePerformance.mdwn b/BtreePerformance.mdwn new file mode 100644 index 0000000..ff3fa3c --- /dev/null +++ b/BtreePerformance.mdwn @@ -0,0 +1,13 @@ + +Here's some btree microbenchmarks, taken September 2023, on a Ryzen 5950X, +turbothreading and hyperthreading off: + + rand_insert: 10M with 1 threads in 15 sec, 1480 nsec per iter, 660K per sec + rand_insert: 10M with 16 threads in 4 sec, 6330 nsec per iter, 2.41M per sec + rand_lookup: 10M with 1 threads in 7 sec, 736 nsec per iter, 1.29M per sec + rand_lookup: 10M with 16 threads in 0 sec, 1165 nsec per iter, 13.1M per sec + + rand_insert: 100M with 1 threads in 165 sec, 1573 nsec per iter , 621K per sec + rand_insert: 100M with 16 threads in 31 sec, 4849 nsec per iter, 3.15M per sec + rand_lookup: 100M with 1 threads in 89 sec, 851 nsec per iter, 1.12M per sec + rand_lookup: 100M with 16 threads in 6 sec, 973 nsec per iter, 15.7M per sec diff --git a/Debugging.mdwn b/Debugging.mdwn new file mode 100644 index 0000000..6dd9054 --- /dev/null +++ b/Debugging.mdwn @@ -0,0 +1,98 @@ +# An overview of bcachefs debugging facilities + +Everything about the internal operation of the system should be easily visible +at runtime, via sysfs, debugfs or tracepoints. If you notice something that +isn't sufficiently visible, please file a bug. + +If something goes wonky or is behaving unexpectedly, there should be enough +information readily and easily available at runtime to understand what bcachefs +is doing and why. + +Also, when an error occurs, the error message should print out _all_ the +relevant information we have; it should print out enough information for the +issue to be debugged, without hunting for more. + +And if something goes really wrong and fsck isn't able to recover, there should +be tooling for working with the developers to get that fixed, too. + +## Runtime facilities + +For inspection of a running bcachefs filesystem, including questions like "what +is my filesystem doing and why?", we have: + + - sysfs: `/sys/fs/bcachefs/<uuid>/` + + Here we've got basic information about the filesystem and member devices. + There's also an `options` directory which allows filesystem options to be + set and queried at runtime, a `time_stats` with statistics on various events + we track latency for, and an `internal` directory with additional debug info. + + - debugfs: `/sys/kernel/debug/bcachefs/<uuid>/` + + Debugfs also shows the full contents of every btree - all metadata is a key + in a btree, so this means all filesystem metadata is inspectable here. + There's additional per-btree files that show other useful btree information + (how full are btree nodes, bkey packing statistics, etc.). + + - tracepoints and counters + + In addition to the usual tracepoints, we keep persistent counters for every + tracepoint event, so that it's possible to see if slowpath events have been + occuring without tracing having been previously enabled. + + `/sys/fs/bcachefs/<uuid>/counters` shows, for every event, the number of + events since filesystem creation, and since mount. + +## Hints on where to get started + +Is something spinning? Does the system appear to be trying to get work done, +without getting anything done? + +Check `top`: this shows CPU usage by thread - is something spinning? + +Check `perf top`: this shows CPU usage, broken out by function/module - what code is spinning? + +Check `perf top -e bcachefs:*`: this shows counters for all bcachefs events - are we hitting a rare or slowpath event? + +Is everything stuck? + +Check `btree_transactions` in debugfs - +`/sys/kernel/debug/bcachefs/<uuid>/btree_transactions`; other files there may +also be relevant. + +Is something stuck? + +Check sysfs `dev-0/alloc_debug`: this shows various internal allocator state - +perhaps the allocator is stuck? + +Something funny with rebalance/background data tasks? + +Check sysfs `internal/rebalance_work`, `internal/moving_ctxts` + +All of this stuff could use reorganizing and expanding, of course. + +## Offline filesystem inspection + +The `bcachefs list` subcommand lists the contents of the btrees - extents, inodes, dirents, and more. + +The `bcachefs list_journal` subcommand lists the contents of the journal. This +can be used to discover what operation caused an error, e.g. reported by fsck, +by searching for the transaction that last updated those key(s). + +### Unrepairable filesystem debugging + +If there's an issue that fsck can't fix, use the `bcachefs dump` subcommand, +and then [[magic wormhole|https://github.com/magic-wormhole/magic-wormhole]], +to send your filesystem metadata to the developers. + +## For the developer + +Internally, bcachefs uses `printbufs` for formatting text in a generic and +structured way, and we try to write `to_text()` functions for as many types as +possible. + +This makes it much easier to write good error messages, and add new debug tools +to sysfs/debugfs; when `to_text()` functions already exist for all the relevant +types, this work is much easier. + +Try to keep up with and extend this approach when working with the code. diff --git a/Roadmap.mdwn b/Roadmap.mdwn index 8106c9b..1e926e9 100644 --- a/Roadmap.mdwn +++ b/Roadmap.mdwn @@ -1,232 +1,82 @@ -# bcachefs status, roadmap: +# Roadmap for future work: [[!toc levels=3 startlevel=2 ]] -## Stabilization - status, work to be done - -### Current stability/robustness - -We recently had a spate of corruptions that users were hitting - at the end of -it, some users had to wait for new repair code to be written but (as far as I -know) no filesystems or even significant amounts of data were lost - our check -and repair code is getting to be really solid. The bugs themselves seem to have -been resolved, except for (it appears) a bug in the journalling code that was -causing journal entries to get lost - that one hasn't been reported in awhile -but we'll still hunt for it some more. - -Users have been pushing it pretty hard. - -### Tooling - -The bcachefs kernel code is also included as a library in bcachefs-tools - 90% -of the kernel code builds and runs in userspace. This is really useful - it -makes new tooling easy to develop, and it also means we can test almost all of -the codebase in userspace, with userspace debugging tools (ASAN and UBSAN in -userspace catch more bugs than the kernel equivalents, and valgrind also catches -things ASAN doesn't). - -We've got a tool for dumping filesystem metadata as qcow2 images - our normal -workflow when a user has something go wrong with their filesystem is for them to -dump the metadata, and then if necessary I can write new repair code and test it -on the dumped image before touching their filesystem. - -A fuse port has been started, but isn't useable yet. This is something we really -want to have - if a user or customer is hitting a bug that we can't reproduce -locally, switching to the fuse version to debug it in userspace is going to save -our bacon someday. - -We also have tooling for examining filesystem metadata - there's userspace -tooling for examining an offline filesystem, and metadata can also be viewed in -debugfs on a mounted filesystem. - -One thing that's missing in the userspace bcachefs tool is access to what's -published via sysfs when running in the kernel - this is a really important -debugging tool. A good project for someone new to the code might be to embed an -http server and implement a sysfs shim. - -### Feature stability - -Erasure coding isn't quite there yet - there's still oopses being reported, and -existing tests aren't finding the bugs. - -Everything else is considered supported. There's still a bug with the refcounts -in the reflink code that xfstests sporadically hits, but that's the current area -of focus and should be closed out soon. - -### Test infrastructure - -We have [[ktest|https://evilpiepirate.org/git/ktest.git]], which is a major asset. -It turns virtual machine testing into a simple commandline tool - all test -output is on standard output, ctrl-c kills the VM and releases all resources. -It's designed for both interactive development (it tries to be as quick as -possible from kernel build to when the tests start running) and automated -testing (test pass/failure is reported as a return code, and implements a -watchdog in case tests hang). - -It provides easy access to ssh, kgdb, qemu's gdb interface, and more. In the -past we've had it working with the kernel's gcov support for code coverage, -getting that going again is high on the todo list. - -### Testing: - -All tests need to be passing. Tests right now are divided between xfstests and -ktest - ktest is used as a wrapper for running xfstests in a virtual machine, -and it also has a good number of bcachefs-specific tests. - -We may want to move the bcachefs tests from ktest to xfstests - having all our -tests in the same place would be more convenient and thus help ensure that -they're getting run. - -We're down to ~15 xfstests tests that aren't passing - all are triaged, none -are particularly concerning. We need to get all of them passing and then start -leaving test runs going 24/7 - when the test dashboard is all green, that makes -it much easier to notice and jump on tests that fail even sporadically. - -The ktest tests haven't been getting run as much and are in a messier state - we -need to get all of them passing and ensure that they're being run continuously -(possibly moving them to xfstests). - -We don't yet have an automated mechanism for running xfstests with the full -matrix of configurable options - we want to be testing with checksumming both on -and off, data compression on and off, encryption on and off, small btree nodes -(to stress greater tree heights), and more. - -### Test coverage - -We need to get code coverage analysis going again - this will definitely -highlight tests that need to be written, and we also need to add error injection -testing. bcache/bcachefs used to have error injection tests, but it was with a -nonstandard error injection framework - some of the error injection points still -exist, though. - -## Performance: - -### Core btree: - -The core btree code is heavily optimized, and I don't expect much in the way of -significant gains without novel data structures. Btree performance is -particularly important in bcachefs because we not shard btrees on a per inode -basis like most filesystems. Instead have one global extents btree, one global -dirents btree, et cetera. Historically, this came about because bcachefs started -as a block cache where we needed a single index that would be fast enough to -index an entire SSD worth of blocks, and for persistent data structures btrees -have a number of advantages over hash tables (can do extents, and when access -patterns have locality they preserve that locality). Using singleton btrees is -a major simplification when writing a filesystem, but it does mean that -everything is leaning rather heavily on the performance and scalability of the -btree implementation. - -Bcachefs b-trees are a hybrid compacting data structure - they share some -aspect with COW btrees, but btree nodes internally are log structured. This -means that btree nodes are big (default 256k), and have a very high fanout. We -keep multiple sorted sets of keys within a btree node (up to 3), -compacting/sorting when the last set, the one that updates go to, becomes too -large. - -Lookups within a btree node use eytzinger layout search trees with summary keys -that fit in 4 bytes. Eytzinger layout means descendents of any given node are -adjacent in memory, meaning they can be effectively prefetched, and 4 byte nodes -mean 16 of them fit on a cacheline, which means we can prefetch 4 levels ahead - -this means we can pipeline effectively enough to get good lookup performance -even when our memery accesses are not cached. - -On my Ryzen 3900X single threaded lookups across 16M keys go at 1.3M second, and -scale almost linearly - with 12 threads it's about 12M/second. We can do 16M -random inserts at 670k per second single threaded; with 12 threads we achieve -2.4M random inserts per second. - -Btree locking is very well optimized. We have external btree iterators which -make it easy to aggressively drop and retake locks, giving the filesystem as a -whole very good latency. - -Minor scalability todo item: btree node splits/merges still take intent locks -all the way up to the root, they should be checking how far up they need to go. -This hasn't shown up yet because bcachefs btree nodes are big and splits/merges -are infrequent. - -### IO path: - -IO paths: the read path looks to be in excellent shape - I see around 350k iops -doing async 4k random reads, single threaded, on a somewhat pokey Xeon server. -We're nowhere near where we should be on 4k random writes - I've been getting in -the neighborhood of 70k iops lately, single threaded, and it's unclear where the -bottleneck is - we don't seem to be cpu bound. Considering the btree performance -this shouldn't be a core architectural issue - work needs to be done to identify -where we're losing performance. - -Filesystem metadata operation performance (e.g. fsmark): On most benchmarks, we -seem to be approaching or sometimes mathing XFS on performance. There are -exceptions: multithreaded rm -rf of empty files is currently roughly 4x slower -than XFS - we bottleneck on journal reclaim, which is flushing inodes from the -btree key cache back to the inodes btree. - -## Scalability: - -There are still some places where we rely on walking/scanning metadata too much. - -### Allocator: - -For allocation purposes, we divide up the device into buckets (default 512k, -currently support up to 8M). There is an in-memory array that represents these -buckets, and the allocator periodically scans this array to fill up freelists - -this is currently a pain point, and eliminating this scanning is a high priority -item. Secondarily, we want to get rid of this in memory array - buckets are -primarily represented in the alloc btree. Almost all of the work to get rid of -this has been done, fixing the allocator to not rescan for empty buckets is the -last major item. - -### Copygc: - -Because allocation is bucket based, we're subject to internal fragmentation that -we need copygc to deal with, and when copygc needs to run it has to scan the -bucket arrays to determine which buckets to evacuate, and then it has to scan -the extents and reflink btrees for extents to be moved. - -This is something we'd like to improve, but is not a major pain point issue - -being a filesystem, we're better positioned to avoid mixing unrelated writes -which tends to be the main cause of write amplification in SSDs. Improving this -will require adding a backpointer index, which will necessarily add overhead to -the write path - thus, I intend to defer this until after we've diagnosed our -current lack of performance in the write path and worked to make is fast as we -think it can go. We may also want backpointers to be an optional feature. - -### Rebalance: +## Rebalance work index: Various filesystem features may want data to be moved around or rewritten in the background: e.g. using a device as a writeback cache, or using the `background_compression` option - excluding copygc, these are all handled by the rebalance thread. Rebalance currently has to scan the entire extents and reflink -btrees to find work items - and this is more of a pain point than copygc, as -this is a normal filesystem feature. We should add a secondary index for extents -that rebalance will need to move or rewrite - this will be a "pay if you're -using it" feature so the performance impact should be less of a concern than -with copygc. - -Some thought and attention should be given to what else we might want to use -rebalance for in the future. - -There are also current bug reports of rebalance reading data and not doing any -work; while work is being done on this subsystem we should think about improving -visibility as to what it's doing. - -### Disk space accounting: - -Currently, disk space accounting is primarily persisted in the journal (on clean -shutdown, it's also stored in the superblock) - and each disk space counter is -added to every journal entry. As the number of devices in a filesystem increases -this overhead grows, but where this will really become an issue is adding -per-snapshot disk space accounting. We really need these counters to be stored -in btree keys, but this will be a nontrivial change as we use a complicated -system of percpu counters for disk space accounting, with multiple sets of -counters for each open journal entry - rolling them up just prior to each -journal write. My plan is to extend the btree key cache so that it can support a -similar mechanism. - -Also, we want disk space accounting to be broken by compression type, for -compressed data, and compressed/uncompressed totals to both be kept - right now -there's no good way for users to find out the compression ratio they're getting. - -### Number of devices in a filesystem: +btrees to find work items: in the near future, we'll be adding a +`rebalance_work` index, which will just be a bitset btree indicating which +extents need to have work done on them. + +The btree write buffer, which e.g. the backpointers code heavily relies upon, +gives us the ability to maintain this index with minimal runtime overhead. + +We also have to start tracking the pending work in the extent itself, with a +new extent field (e.g. compress with given compression type, for the +`background_compression` option, or move to different target for the +`background_target` option). + +Most of the design and implementation for this is complete - there is still +some work related to the triggers code remaining. + +## Compression + +We'd like to add multithreaded compression, for when a single thread is doing +writes at a high enough rate to be CPU bound. + +We'd also like to add support for xz compression: this will be useful when +doing a "squashfs-like" mode. The blocker is that kernel only has support for +xz decompression, not compression - hopefully we can get this added. + +## Scrub + +We're still missing scrub support. Scrub's job will be to walk all data in the +filesystem and verify checksums, recovery bad data from a good copy if it +exists or notifying the user if data is unrecoverable. + +Since we now have backpointers, we have the option of doing scrub in keyspace +order, or lba order - lba order will be much better for rotational devices. + +## Configurationless tiered storage + +We're starting to work on benchmarking individual devices at format time, and +storing their performance characteristics in the superblock. + +This will allow for a number of useful things - including better tracking of +drive health over time - but the real prize will be tiering that doesn't +require explicit configuration: foreground writes will preferentially go to +the fastest device(s), and in the background data will be moved from fast +devices to slow devices, taking into account how hot or cold that data is. + +Even better, this will finally give us a way to manage more than two tiers, or +performance classes, of devices. + +## Disk space accounting: + +Currently, disk space accounting is primarily persisted in the journal (on +clean shutdown, it's also stored in the superblock), and each disk space +counter is added to every journal entry. + +So currently there's a real limit on how many counters we can maintain, and +we'll need to do something else before adding e.g. per snapshot disk space +accounting. + +We'll need to make disk space accounting keys in a btree, like all other +metadata. Since disk space accounting was implemented we've gained the btree +write buffer code, so this is now possible: on transaction commit, we will do +write buffer updates with deltas for the counters being modified, and then the +btree write buffer flush code will handle summing up the deltas and flushing +the updates to the underlying btree. + +We also want to add more counters to the inode. Right now in the inode we only +track sectors used as fallocate would see them - we should add counters that +take into account replication and compression. + +## Number of devices in a filesystem: Currently, we're limited to a maximum of 64 devices in a filesystem. It would be desirable and not much work to lift that limit - it's not encoded in any on disk @@ -265,98 +115,97 @@ which means that the failure of _any_ three devices would lead to loss of data. A feature that addresses these issues still needs to be designed. -### Fsck +### Support for sub-block size writes -Fsck performance is good, but for supporting huge filesystems we are going to -need online fsck. +Devices with larger than 4k blocksize are coming, and this is going to result +in more wasted disk space due to internal fragmentation. -Fsck in bcachefs is divided up into two main components: there is `btree_gc`, -which walks the btree and regenerates allocation information (and also checks -btree topology), and there is fsck proper which checks higher level filesystem -invariants (e.g. extents/dirents/xattrs must belong to an inode, and filesystem -connectivity). +Support for writes at smaller than blocksize granularity would be a useful +addition. The read size is easy: they'll require bouncing, which we already +support. Writes are trickier - there's a few options: -The `btree_gc` subsystem is so named because originally (when bcachefs was -bcache) not only did not have persistent allocation information, it also didn't -keep bucket sector counts up to date when adding/removing extents from the btree -- it relied entirely on periodically rescanning the btree. + * Buffer up writes when we don't have full blocks to write? Highly + problematic, not going to do this. + * Read modify write? Not an option for raw flash, would prefer it to not be + our only option + * Do data journalling when we don't have a full block to write? Possible + solution, we want data journalling anyways -This capability was kept for a very long time, partly as an aid to testing an -debugging when uptodate allocation information was being developed, then -persistent, then transactionally persistent allocation information was -developed. The capability to run `btree_gc` at runtime has been disabled for -roughly the past year or so - it was a bit too much to deal with while getting -everything else stabilized - but the code has been retained and there haven't -been any real architectural changes that would preclude making it a supported -feature again - this is something that we'd like to do, after bcachefs is -upstreamed and when more manpower is available. +### Journal on NVRAM devices -Additionally, fsck itself is part of the kernel bcachefs codebase - the -userspace fsck tool uses a shim layer to run the kernel bcachefs codebase in -userspace, and fsck uses the normal btree iterators interface that the rest of -the filesystem uses. The in-kernel fsck code can be run at mount time by -mounting wiht the `-o fsck` option - this is essentially what the userspace -fsck tool does, just running the same code in userspace. +NVRAM devices are currently a specialized thing, which are always promising to +become more widely available. We could make use of them for the journal; +they'd let us cheaply add full data journalling (useful for previous feature) +and greatly reduce fsync overhead. -This means that there's nothing preventing us from running the current fsck -implementation at runtime, while the filesystem is in use, and there's actually -not much work that needs to be done for this to work properly and reliably: we -need to add locking for the parts of fsck are checking nonlocal invariants and -can't rely solely on btree iterators for locking. +### Online fsck -Since backpointers from inodes to dirents were recently added, the checking of -directory connectivity is about to become drastically simpler and shouldn't need -any external data structures or locking with the rest of the filesystem. The -only fsck passes that appear to require new locking are: +For supporting huge filesystems, and competing with XFS, we will eventually +need online fsck. -* Counting of extents/dirents to validate `i_sectors` and subdirectory counts. +This will be easier in bcachefs than it was in XFS, since our fsck +implementation is part of the main filesystem codebase and uses the normal +btree transaction API that's available at runtime. -* Checking of `i_nlink` for inodes with hardlinks - but since inode backpointers - have now been added, this pass now only needs to consider inodes that actually - have hardlinks. +That means that the current fsck implementation will eventually also be the +online fsck implementation: the process will mostly be a matter of auditing the +existing fsck codebase for locking issues, and adding additional locking w.r.t. +the rest of the filesystem where needed. -So that's cool. +## ZNS, SMR device support: -### Deleted inodes after unclean shutdown +[[ZNS|https://zonedstorage.io/docs/introduction/zns]] is a technology for +cutting out the normal normal flash translation layer (FTL), and instead +exposing an interface closer to the raw flash. -Currently, we have to scan the inodes btree after an unclean shutdown to find -unlinked inodes that need to be deleted, and on large filesystems this is the -slowest part of the mount process (after unclean shutdown). +SSDs have large erase units, much larger than the write blocksize, so the FTL +has to implement a mapping layer as well as garbage collection to expose a +normal, random write block device interface. Since a copy-on-write filesystem +needs both of these anyways, adding ZNS support to bcachefs will eliminate a +significant amount of overhead and improve performance, especially tail latency. -We need to add a hidden directory for unlinked inodes. +[[SMR|https://en.wikipedia.org/wiki/Shingled_magnetic_recording]] is another +technology for rotational hard drives, completely different from ZNS in +implementation but quite similar in how it is exposed and used. SMR enables +higher density, and supporting it directly in the filesystem eliminates much of +the performance penalties. -### Snapshots: +bcachefs allocation is bucket based: we allocate an entire bucket, write to it +once, and then never rewrite it until the entire bucket is empty and ready to +be reused. This maps nicely to both ZNS and SMR zones, which makes supporting +both of these directly particularly attractive. -I expect bcachefs to scale well into the millions of snapshots. There will need -to be some additional work to make this happen, but it shouldn't be much. +## Squashfs mode -* We have a function, `bch2_snapshot_is_ancestor()`, that needs to be fast - - it's called by the btree iterator code (in `BTREE_ITER_FILTER_SNAPSHOTS` mode) - for every key it looks at. The current implementation is linear in the depth - of the snapshot tree. We'll probably need to improve this - perhaps by caching - a bit vector, for each snapshot ID, of ancestor IDs. +Righ now we have dedicated filesystems, like +[[squashfs|https://en.wikipedia.org/wiki/SquashFS]], for building and reading +from a highly compressed but read-only filesystem. -* Too many overwrites in different snapshots at the same part of the keyspace - will slow down lookups - we haven't seen this come up yet, but it certainly - will at some point. +bcachefs metadata is highly space efficient(e.g. [[packed +bkeys|https://bcachefs.org/Architecture/#Bkey_packing]], and could probably +serve as a direct replacement with a relatively small amount of work to +generate a filesystem image in the most compact way. The same filesystem image +could later be mounted read-write, if given more space. - When this happens, it won't be for most files in the filesystem, in anything - like typical usage - most files are created, written to once, and then - overwritten with a rename; it will be for a subset of files that are long - lived and continuously being overwritten - i.e. databases. +Specifying a directory tree to include when creating a new filesystem is +already a requested feature - other filesystems have this as the `-c` option +to mkfs. - To deal with this we'll need to implement per-inode counters so that we can - detect this situation - and then when we detect too many overwrites, we can - allocate a new inode number internally, and move all keys that aren't visible - in the main subvolume to that inode number. +## Container filesystem mode -## Features: +Container filesystems are another special purpose type of filesystem we could +potentially fold into bcachefs - e.g. +[[puzzlefs|https://github.com/project-machine/puzzlefs]] -### SMR device support: +These have a number of features, e.g. verity support for cryptographically +verifying a filesystem image (which we can do via our existing cryptographic +MAC support) - but the interesting feature here is support for referring to +data (extents, entire files) in a separate location - in this case, the host +filesystem. -This is something I'd like to see added - bcachefs buckets map nicely to SMR -zones, we just need support for larger buckets (on disk format now supports -them, we'll need to add an alternate in memory represenatation for large -buckets), and plumbing to read the zone pointers and use them in the allocator. +### Cloud storage management -We already have copygc, so everything should just work. +The previous feature is potentially more widely useful: allowing inodes and +extents to refer to arbitrary callbacks for fetching/storing data would be the +basis for managing data in e.g. cloud storage, and in combination with local +storage in a seamless tierid system. diff --git a/Todo.mdwn b/Wishlist.mdwn index 06fbdb5..44a48a0 100644 --- a/Todo.mdwn +++ b/Wishlist.mdwn @@ -1,72 +1,4 @@ -# TODO - -## Kernel developers - -### P0 features - - * Disk space accounting rework: now that we've got the btree write buffer, we - can store disk space accounting in btree keys (instead of the current hacky - mechanism where it's only stored in the journal). - - This will make it possible to store many more counters - meaning we'll be - able to store per-snapshot-id accounting. - - We also need to add separate counters for compressed/uncompressed size, and - broken out by compression type. - - It'd also be really nice to add more counters to the inode for - compressed/uncompressed size. Right now the inode just has `bi_sectors`, - which the ondisk size for fallocate purposes, i.e. discounting replication. - We could add something like - - bi_sectors_ondisk_compressed - bi_sectors_ondisk_uncompressed - -### P1 features - -### P2 features - - * When we're using compression, we end up wasting a fair amount of space on - internal fragmentation because compressed extents get rounded up to the - filesystem block size when they're written - usually 4k. It'd be really nice - if we could pack them in more efficiently - probably 512 byte sector - granularity. - - On the read side this is no big deal to support - we have to bounce - compressed extents anyways. The write side is the annoying part. The options - are: - * Buffer up writes when we don't have full blocks to write? Highly - problematic, not going to do this. - * Read modify write? Not an option for raw flash, would prefer it to not be - our only option - * Do data journalling when we don't have a full block to write? Possible - solution, we want data journalling anyways - - * Full data journalling - we're definitely going to want this for when the - journal is on an NVRAM device (also need to implement external journalling - (easy), and direct journal on NVRAM support (what's involved here?)). - - Would be good to get a simple implementation done and tested so we know what - the on disk format is going to be. - -## Developers - - * End user documentation needs a lot of work - complete man pages, etc. - - * bcachefs-tools needs some fleshing out in the --help department - -## Users - - * Benchmark bcachefs performance on different configurations: - (Note that these tests will have to be repeated in the future.) - * SSD - * HDD - * SSD + HDD - - * Update the website / Documentation - * Ask questions (so and we can update the documentation / website). - -## Wishlist +# Wishlist items * "Seed devices", hard-readonly devices that are CoWed from on write (btrfs has this; useful for base devices for virtualization, among other things). @@ -1,4 +1,3 @@ - # "The COW filesystem for Linux that won't eat your data". Bcachefs is an advanced new filesystem for Linux, with an emphasis on @@ -17,8 +16,28 @@ from a modern filesystem. * Scalable - has been tested to 100+ TB, expected to scale far higher (testers wanted!) * High performance, low tail latency * Already working and stable, with a small community of users +* [[Use Cases|UseCases]] + +## Documentation + +* [[Getting Started|GettingStarted]] +* [[User manual|bcachefs-principles-of-operation.pdf]] +* [[FAQ]] + +## Debugging tools -## Philosophy +bcachefs has extensive debugging tools and facilities for inspecting the state +of the system while running: [[Debugging]]. + +## Development tools + +bcachefs development is done with +[[ktest|https://evilpiepirate.org/git/ktest.git]], which is used for both +interactive and automated testing, with a large test suite. + +[[Test dashboard|https://evilpiepirate.org/~testdashboard/ci]] + +## Philosophy, vision We prioritize robustness and reliability over features: we make every effort to ensure you won't lose data. It's building on top of a codebase with a pedigree @@ -28,17 +47,76 @@ keeping things stable, and aggressively fixing design issues as they are found; the bcachefs codebase is considerably more robust and mature than upstream bcache. -## Documentation +The long term goal of bcachefs is to produce a truly general purpose filesystem: + - scalable and reliable for the high end + - simple and easy to use + - an extensible and modular platform for new feature development, based on a + core that is a general purpose database, including potentially distributed storage -* [[Getting Started|GettingStarted]] -* [[User manual|bcachefs-principles-of-operation.pdf]] -* [[FAQ]] +## Some technical high points + +Filesystems have conventionally been implemented with a great deal of special +purpose, ad-hoc data structures; a filesystem-as-a-database is a rarer beast: + +### btree: high performance, low latency + +The core of bcachefs is a high performance, low latency b+ tree. +[[Wikipedia|https://en.wikipedia.org/wiki/Bcachefs]] covers some of the +highlights - large, log structured btree nodes, which enables some novel +performance optimizations. As a result, the bcachefs b+ tree is one of the +fastest production ordered key value stores around: [[benchmarks|BtreePerformance]]. + +Tail latency has also historically been a difficult area for filesystems, due +largely to locking and metadata write ordering dependencies. The database +approach allows bcachefs to shine here as well, it gives us a unified way to +handle locking for all on disk state, and introduce patterns and techniques for +avoiding aforementioned dependencies - we can easily avoid holding btree locks +while doing blocking operations, and as a result benchmarks show write +performance to be more consistant than even XFS. + +#### Sophisticated transaction model + +The main interface between the database layer and the filesystem layer provides + - Transactions: updates are queued up, and are visible to code running within + the transaction, but not the rest of the system until a successful + transaction commit + - Deadlock avoidance: High level filesystem code need not concern itself with lock ordering + - Sophisticated iterators + - Memoized btree lookups, for efficient transaction restart handling, as well + as greatly simplifying high level filesystem code that need not pass + iterators around to avoid lookups unnecessarily. + +#### Triggers + +Like other database systems, bcachefs-the-database provides triggers: hooks run +when keys enter or leave the btree - this is used for e.g. disk space +accounting. + +Coupled with the btree write buffer code, this gets us highly efficient +backpointers (for copygc), and in the future and efficient way to maintain an +index-by-hash for data deduplication. + +### Unified codebase + +The entire bcachefs codebase can be built and used either inside the kernel, or +in userspace - notably, fsck is not a from-scratch implementation, it's just a +small module in the larger bcachefs codebase. + +### Rust + +We've got some initial work done on transitioning to Rust, with plans for much +more: here's an example of walking the btree, from Rust: +[[cmd_list|https://evilpiepirate.org/git/bcachefs-tools.git/tree/rust-src/src/cmd_list.rs]] ## Contact and support Developing a filesystem is also not cheap, quick, or easy; we need funding! Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] +We're also now offering contracts for support and feature development - +[[email|kent.overstreet@gmail.com]] for more info. Check the +[[roadmap|Roadmap]] for ideas on things you might like to support. + Join us in the bcache [[IRC|Irc]] channel, we have a small group of bcachefs users and testers there: #bcache on OFTC (irc.oftc.net). diff --git a/sidebar.mdwn b/sidebar.mdwn index fe67d26..e94c7fe 100644 --- a/sidebar.mdwn +++ b/sidebar.mdwn @@ -9,7 +9,7 @@ * [[Allocator]] * [[Fsck]] * [[Roadmap]] - * [[Todo]] + * [[Wishlist]] * [[TestServerSetup]] * [[StablePages]] * [[October 2022 talk for Redhat|bcachefs_talk_2022_10.mpv]] |