Snapshots & subvolumes: ======================= The short version: bcachefs snapshots are a different approach - we're not using COW btrees. Instead, they're based on extending the key filesystem items (inodes, dirents, xattrs, extents) with a version number - a snapshot id - for the low bits. Snapshot IDs form a tree, which will be recorded in the snapshots btree. The root snapshot ID is `U32_MAX`, and we allocate new snapshot IDs growing down so that the parent of a given snapshot ID is always larger than that ID. To create a new writeable snapshot, we allocate two new snapshot IDs. We update the existing subvolume with one of the new snapshot IDs, and assign the other snapshot ID to the new snapshot. When we do a lookup for a filesystem item, we have to check if the snapshot ID of the key we found is an ancestor of the snapshot ID we're searching for, and filter out items that aren't. Subvolumes: =========== Subvolumes are needed for two reasons: * They're the mechanism for accessing snapshots * Also, they're a way of isolating different parts of the filesystem heirarchy from snapshots, or taking snapshots that aren't global. I.e. you'd create a subvolume for your database so that filesystem snapshots don't COW it, or create subvolumes for each home directory so that users can snapshot their own home directories. The functionality and userspace interface for snapshots and subvolumes are roughly modelled after btrfs, but simplified. Subvolumes in bcachefs are just fancy directories. We introduce internally a new dirent type that can point to subvolumes, instead of inodes, and the subvolume has a pointer to the root inode. Subvolumes get their own table (btree), and subvolume keys have fields for root inode number and snapshot ID. Subvolumes have no name outside of the filesystem heirarchy. This means that, in order to enumerate and list subvolumes, we need to be able to reconstruct their path. To reconstruct paths, we're adding inode backpointers - two new inode fields for the inode number of the directory they're in, and the dirent offset. We're only adding fields for a single backpointer, i.e. we're not handling hardlinks yet - we set an inode flag indicating the backpointer fields are untrusted whenever we create a hardlink. If we do ever want to add multiple backpointers for files with hardlinks, we'll need to add another btree where each backpointer gets its own key. This should be good enough for now. Subvolume root inodes have two more fields: one for the subvolume ID, and another for the parent subvolume ID. The subvolume ID field is only set for subvolume roots because otherwise taking a snapshot would require updating every inode in that subvolume. With these fields and inode backpointers, we'll be able to reconstruct a path to any directory, or any file that hasn't been hardlinked. Snapshots: ========== We're also adding another table (btree) for snapshot keys. Snapshot keys form a tree where each node is just a u32. The btree iterator code that filters by snapshot ID assumes that a parent IDs are always larger than child IDs, so the root starts at `U32_MAX`. And, there will be multiple trees - creating a new empty subvolume will allocate a new snapshot ID that has no parent node. Any filesystem operation that's within a subvolume starts by looking up the key for that subvolume to get the current snapshot ID, to be used for both lookups and updates. This means we have to remember what subvolume we're in, in the in memory `bch_inode_info` - as mentioned previously only subvolume root inodes have this field in the btree. The btree iterator code is getting two new flags - `BTREE_ITER_ALL_SNAPSHOTS` and `BTREE_ITER_FILTER_SNAPSHOTS`, that controls how iteration handles the snapshot field of the key. One of these flags should be specified for iterators for a btree that uses the snapshots field. `BTREE_ITER_ALL_SNAPSHOTS` means don't handle the snapshot field specially: it returns every key, and advancing/rewinding the iterator position increments/decrements the snapshot field. `BTREE_ITER_FILTER_SNAPSHOTS` means incrementing/decrementing the iterator position does not include the snapshot field - there's only one iterator position for each inode number:offset - and we return the key that matches the specified snapshot ID, or the first ancestor if not found. The update path, `bch2_trans_commit()` now has more work to do: * When deleting, we have to check if there's another key at the same position in an ancestor snapshot that would become visible - if so, we need to insert a whiteout instead. * When there was a key at the current position in an ancestor snapshot ID that's being overwritten (i.e. it will no longer be visible in the current snapshot) - we should also check if it's visible in any other snapshots, and if not, delete it. Extents turn out to not require much special handling: extent handling has already been moved out of the core btree code, and there's now a pass at the top of `bch2_trans_commit()` for extents that walks the list of updates and compares them to what's being overwritten, and for extents being partially overwritten generates more updates for the existing extents. This all does basically what we want for snapshots and won't have to change much. There are implications for splitting and merging of existing extents. If we have to split an existing extent (i.e. when copygc is moving around existing data and the write path had to fragment it) - for each child snapshot ID of the extent's snapshot ID, if the extent is not visible in that snapshot emit a whiteout for the fragment. Conversely, existing extents may not be merged if one of them is visible in a child snapshot and the other is not. Snapshot deletion: ================== In the current design, deleting a snapshot will require walking every btree that has snapshots (extents, inodes, dirents and xattrs) to find and delete keys with the given snapshot ID. It would be nice to improve this. Permissions: ============ Creating a new empty subvolume can be done by untrusted users anywhere they could call mkdir(). Creating a snapshot will also be an untrusted operation - the only additional requirement being that untrusted users must own the root of the subvolume being snapshotted. Fsck: ===== The fsck code is going to have to be reworked to use `BTREE_ITER_ALL_SNAPSHOTS` and check keys in all snapshots all at once, in order to have acceptable performance. This has not been started on yet.