index.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179


# Bcachefs

Bcachefs is an advanced new filesystem for Linux, with an emphasis on
reliability and robustness: it's the COW filesystem that won't lose your data.

It has a long list of features, completed or in progress:

* Copy on write (COW) - like zfs or btrfs
* Full data and metadata checksumming
* Multiple devices, including replication and other types of RAID
* Caching
* Compression
* Encryption
* Snapshots
* Scalable - has been tested to 50+ TB, will eventually scale far higher
* Already working and stable, with a small community of users

We prioritize robustness and reliability over features and hype: we make every
effort to ensure you won't lose data. It's building on top of a codebase with a
pedigree - bcache already has a reasonably good track record for reliability
(particularly considering how young upstream bcache is, in terms of engineer
man/years). Starting from there, bcachefs development has prioritized
incremental development, and keeping things stable, and aggressively fixing
design issues as they are found; the bcachefs codebase is considerably more
robust and mature than upstream bcache.

Developing a filesystem is also not cheap or quick or easy; we need funding!
Please chip in on [[Patreon|https://www.patreon.com/bcachefs]] - the Patreon
page also has more information on the motivation for bcachefs and the state of
Linux filesystems, as well as some bcachefs status updates and information on
development.

If you don't want to use Patreon, I'm also happy to take donations via paypal:
kent.overstreet@gmail.com.

Join us in the bcache IRC channel, we have a small group of bcachefs users and
testers there: #bcache on OFTC (irc.oftc.net).

## Getting started

Bcachefs is not yet upstream - you'll have to build a kernel to use it. 

First, check out the bcache kernel and tools repositories:

    git clone https://evilpiepirate.org/git/bcachefs.git
    git clone https://evilpiepirate.org/git/bcachefs-tools.git

Build and install as usual - make sure you enable `CONFIG_BCACHE_FS` Then, to
format and mount a single device with the default options, run:

    bcachefs format /dev/sda1
    mount -t bcachefs /dev/sda1 /mnt

See `bcachefs format --help` for more options.

## Documentation

End user documentation is currently fairly minimal; this would be a very helpful
area for anyone who wishes to contribute - I would like the bcache man page in
the bcache-tools repository to be rewritten and expanded.

## Status

Bcachefs can currently be considered beta quality. It has a small pool of
outside users and has been stable for quite some time now; there's no reason
to expect issues as long as you stick to the currently supported feature set.
However, given that it's still under active development backups are a good idea.
It's been passing all xfstests for well over a year.

Performance is generally quite good - generally faster than btrfs, and not far
behind xfs/ext4. On metadata intensive benchmarks, it's often considerably
faster than xfs/ext4/btrfs.

Normal posix filesystem functionality is all finished - if you're using bcachefs
as a replacement for ext4 on a desktop, you shouldn't find anything missing. For
servers, NFS export support is still missing (but coming soon) and we don't yet
support quotas (probably further off).

Up until bcachefs goes upstream I reserve the right to change the on disk format
if necessary, but I'm not expecting any more incompatible disk format changes.

### Feature status

 - Full data checksumming

   Fully supported and enabled by default. We do need to implement scrubbing,
   once we've got replication and can take advantage of it.

 - Compression

   Not _quite_ finished - it's safe to enable, but there's some work left
   related to copy GC before we can enable free space accounting based on
   compressed size: right now, enabling compression won't actually let you store
   any more data in your filesystem than if the data was uncompressed

 - Tiering/writeback caching:

   Bcachefs allows you to assign devices to different tiers - the faster tier
   will effectively be used as a writeback cache for the slower tier, and
   metadata will be pinned in the faster tier.

   Basic tiering functionality works, but it's not (yet) as configurable as
   bcache's caching (e.g. you can't specify writethrough caching).

 - Replication

   All the core functionality is complete, and it's getting close to usable: you
   can create a multi device filesystem with replication, and then while the
   filesystem is in use take one device offline without any loss of
   availability.

 - [[Encryption]]

   Whole filesystem AEAD style encryption (with ChaCha20 and Poly1305) is done
   and merged. I would suggest not relying on it for anything critical until the
   code has seen more outside review, though.

 - Snapshots

   Snapshot implementation has been started, but snapshots are by far the most
   complex of the remaining features to implement - it's going to be quite
   awhile before I can dedicate enough time to finishing them, but I'm very much
   looking forward to showing off what it'll be able to do.

### Known issues/caveats

 - Mount time

   We currently walk all metadata at mount time (multiple times, in fact) - on
   flash this shouldn't even be noticeable unless your filesystem is very large,
   but on rotating disk expect mount times to be slow.

   This will be addressed in the future - mount times will likely be the next
   big push after the next big batch of on disk format changes.

## Todo list

### Current priorities:

 * Replication 

 * Compression is almost done: it's quite thoroughly tested, the only remaining
   issue is a problem with copygc fragmenting existing compressed extents that
   only breaks accounting.

 * NFS export support is almost done: implementing i_generation correctly
   required some new transaction machinery, but that's mostly done. What's left
   is implementing a new kind of reservation of journal space for the new, long
   running transactions.

### Other wishlist items:

 * When we're using compression, we end up wasting a fair amount of space on
   internal fragmentation because compressed extents get rounded up to the
   filesystem block size when they're written - usually 4k. It'd be really nice
   if we could pack them in more efficiently - probably 512 byte sector
   granularity.

   On the read side this is no big deal to support - we have to bounce
   compressed extents anyways. The write side is the annoying part. The options
   are:
    * Buffer up writes when we don't have full blocks to write? Highly
      problematic, not going to do this.
    * Read modify write? Not an option for raw flash, would prefer it to not be
      our only option
    * Do data journalling when we don't have a full block to write? Possible
      solution, we want data journalling anyways

 * Inline extents - good for space efficiency for both small files, and
   compression when extents happen to compress particularly well.

 * Full data journalling - we're definitely going to want this for when the
   journal is on an NVRAM device (also need to implement external journalling
   (easy), and direct journal on NVRAM support (what's involved here?)).

   Would be good to get a simple implementation done and tested so we know what
   the on disk format is going to be.