Todo.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175

#TODO

## Kernel developers

### Current priorities

 * Replication

 * Compression is almost done: it's quite thoroughly tested, the only remaining
   issue is a problem with copygc fragmenting existing compressed extents that
   only breaks accounting.

 * NFS export support is almost done: implementing i_generation correctly
   required some new transaction machinery, but that's mostly done. What's left
   is implementing a new kind of reservation of journal space for the new, long
   running transactions.

 * Allocation information (currently just bucket generation numbers & priority
   numbers, for LRU caching) needs to be moved into a btree, and we need to
   start persisting actual allocation information so we don't have to walk all
   extents at mount time.

   Just moving the existing prios/gens to a btree will be a significant
   improvement - besides getting us incrementally closer to persisting full
   allocation information, the existing code is a rather hacky mechanism dating
   from the early days of bcache and has recently been the source of an annoying
   bug due to the design being a bit fragile, and it'll be a performance
   improvement since it'll get rid of the last source of forced journal flushes.

### Other

 * When we're using compression, we end up wasting a fair amount of space on
   internal fragmentation because compressed extents get rounded up to the
   filesystem block size when they're written - usually 4k. It'd be really nice
   if we could pack them in more efficiently - probably 512 byte sector
   granularity.

   On the read side this is no big deal to support - we have to bounce
   compressed extents anyways. The write side is the annoying part. The options
   are:
    * Buffer up writes when we don't have full blocks to write? Highly
      problematic, not going to do this.
    * Read modify write? Not an option for raw flash, would prefer it to not be
      our only option
    * Do data journalling when we don't have a full block to write? Possible
      solution, we want data journalling anyways

 * Inline extents - good for space efficiency for both small files, and
   compression when extents happen to compress particularly well.

 * Full data journalling - we're definitely going to want this for when the
   journal is on an NVRAM device (also need to implement external journalling
   (easy), and direct journal on NVRAM support (what's involved here?)).

   Would be good to get a simple implementation done and tested so we know what
   the on disk format is going to be.

### Low priority

 * NOCOW

### Optional
(Will not be implemented by the lead developer.)

 * It is possible to add bcache functionality to bcachefs.
   If there is someone who wishes to implement this it would be an ok addition.
   However using it as a filesystem should still be better.

## Developers

 * End user documentation needs a lot of work - complete man pages, etc.

 * bcachefs-tools needs some fleshing out in the --help department
 * Write a tool to benchmark tail-latency.

## Users

 * Benchmark bcachefs performance on different configurations:
   (Note that these tests will have to be repeated in the future.)
   * SSD
   * HDD
   * SSD + HDD
 
 * Update the website / Documentation
 * Ask questions (so and we can update the documentation / website).

## Wishlist

 * "Seed devices", hard-readonly devices that are CoWed from on write (btrfs
   has this; useful for base devices for virtualization, among other things).
 * Nonce-misuse-resistant authenticated encryption, such as [AES-SIV][AESSIV]
   or HS1-SIV (Closes potential hole regarding nonce reuse and "external"
   snapshots, as might happen to VMs or systems with externally-managed storage
   like iSCSI).
 * Some form of "secure delete" functionality. (However, see [this LWN article][secdel]
   regarding implementation strategies and pitfalls).
 * A simplified userspace API with no hierarchy, only blobs identified by
   unique integer keys (eternaleye thinks this might be useful for
   object-capability systems, such as [Robigalia][robigalia]).
 * An API like the above, but supporting multiple streams per blob, possibly
   with string identifiers (needs further examination, intent is to match the
   [needs of CephFS for OSD backends][cephback]).
 * More advanced caching algorithms; one potentially-relevant paper is
   [Pannier: A Container-based Flash Cache for Compound Objects][pannier].
 * "Asymmetrical" compression algorithms, that support only decompression (XZ
   is a nice candidate here, and would be a very good match for some seed
   device use cases).
 * RAID-6 with parity 3 or greater - could potentially use Andrea Mazzoleni's
   [technique][triple-parity] for generating Cauchy matrices compatible with
   Linux' current RAID-5 and RAID-6 formats, providing a clean upgrade path.
 * "Inline" forward error correction, possibly using a fountain code like
   [RaptorQ][RFC6330].
 * Support Trusted/Encrypted kernel keyring keys, in order to take advantage
   of TPMs.
 * Support for multiple key slots.
   * LUKS2 has a new [keyslot system][LUKS2-keyslots] that better supports
     two-factor auth and other external keying mechanisms.
 * Ponder the ramifications of (and safe defaults for) compression in the
   presence of encryption.
 * Swap file support.
 * Support (a subset of?) the Ext[234] attributes denoting special behaviors:
   * `+c` for compressed files
   * `+C` for disabling copy-on-write
   * `+e` for extent-based storage (always set? btrfs doesn't set it...)
   * `+E` for _displaying_ that a file is encrypted
   * `+N` for _displaying_ that a file's contents are inlined into the inode
   * `+s` for files that should be securely erased on delete
   * `+u` for files that should permit being "undeleted"
   * Others seem less relevant, but may be worth investigating.
 * "Remote Subvolumes", subvolumes that reside on a subset of the filesystem's
   physical devices and can be deactivated so that the physical devices can be
   detached (if all subvolumes on them are remote and deactivated) without
   unmounting the entire filesystem.
 * Per-file allocation policies - highly useful for VM disks, potential zvol
   killer.
 * Conversion of multi-device filesystems (potentially relevant for btrfs)
   * May be possible to use the new `getfsmap` ioctl; it seems to provide
     device information, and it looks like it iterates the whole FS (perfect
     for doing a whole-FS conversion, so it may even simplify single-device
     cases).
 * A fully-explicit mount syntax that does no scanning at all, possibly with
   a purely-userspace mount.bcachefs helper for scanning that then invokes
   the explicit form. Allows eliminating scanning/registration from the kernel.
   * Strawman: mount -t bcachefs bcachefs -o dev=...,dev=... /mount/point
 * Support the [`i_version` inode attribute][iversion], useful for NFSv4,
   backup tools, indexers, and other things that want to have a counter
   that updates whenever something changes about a file.
 * Support DAX, to allow using NVM devices directly and reduce page-cache
   pressure
   * Caveat: May need careful thinking if the kernel assumes it can write
     directly if DAX is supported; getting checksums invalidated would suck.
 * Support converting filesystems not just to bcachefs volumes, but to image
   files _on_ bcachefs volumes (and once reflinks are in, to both at once)
   * Notably, if conversion made an image as the first step, this could allow
     the subsequent extraction of files to bcachefs files a _completely
     userspace process_, which merely reflinks ranges of content out. This
     would also avoid the risks facing the current system, where the converter
     notes where the data lives in the "inner" bcachefs image, and then the
     content is moved out from under it (such as by GC, or btrfs rebalancing)
 * Support "adopting" devices, by importing their content _into_ an existing
   filesystem using the existing extents on the old device. In theory, could
   supplant "adding a cache" in a minimally-invasive way.
 * Add a bcachefs command to export a path as a block device (satisfy some
   bcache use cases that bcachefs is clumsy for today, and in combination
   with the above two items may make migration from other filesystems easier)

[AESSIV]: https://tools.ietf.org/html/rfc5297
[secdel]: https://lwn.net/Articles/462437/
[robigalia]: https://robigalia.org
[cephback]: http://bryanapperson.com/blog/ceph-osd-performance/
[pannier]: https://pdfs.semanticscholar.org/fa5f/3aa6de62e126e6fe2986c70a34e4d678860b.pdf
[triple-parity]: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg28964.html
[RFC6330]: https://tools.ietf.org/html/rfc6330
[LUKS2-keyslots]: https://fosdem.org/2018/schedule/event/cryptsetup/attachments/slides/2506/export/events/attachments/cryptsetup/slides/2506/fosdem18_cryptsetup_aead.pdf#page=24
[iversion]: https://jtlayton.wordpress.com/2016/12/16/the-inode-i_version-counter-in-linux/