summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorKent Overstreet <kent.overstreet@gmail.com>2021-05-23 02:47:13 -0400
committerKent Overstreet <kent.overstreet@gmail.com>2021-05-23 02:47:13 -0400
commitd1386efe8b64c2fb1e6a269824ead10f01f785f3 (patch)
tree05dc385148c870a7445ac5ec0da108fb463253e8
parent49eaf3799da30167e9d6f46afd1e4540692bae87 (diff)
fix typo
-rw-r--r--Allocator.mdwn80
-rw-r--r--Fsck.mdwn75
-rw-r--r--Roadmap.mdwn2
-rw-r--r--Snapshots.mdwn42
4 files changed, 127 insertions, 72 deletions
diff --git a/Allocator.mdwn b/Allocator.mdwn
index 9bad84a..2c15a68 100644
--- a/Allocator.mdwn
+++ b/Allocator.mdwn
@@ -45,11 +45,77 @@ available buckets. It's too easy to end up with bugs where this scanning is
happening repeatedly (we have one now...), and it's a scalability issue and we
shouldn't be doing it at all.
-Also, we would really like to get rid of the in memory array of buckets, this is
-another scalability issue. The in memory struct bucket now mostly just mirrors
-what's in the btree, in `KEY_TYPE_alloc` keys. The one exception is
-`journal_seq`, which is the journal sequence number of the last btree update
-that touched that bucket. It's needed for knowing whether we need to flush the
-journal before allocating and writing to that bucket - perhaps it should just be
-added to `struct bch_alloc` and stored in the keys in the btree.
+## In memory bucket array - kill
+We need to get rid of the in memory array of buckets - it's another scalability
+issue, and at this point it's an anachronism since now it mostly just mirrors
+what's in the btree, in `KEY_TYPE_alloc` keys. Exceptions:
+
+* `journal_seq`, which is the journal sequence number of the last btree update
+ that touched that bucket. It's needed for knowing whether we need to flush the
+ journal before allocating and writing to that bucket - we need to create
+ another data structure for bucket that can't be used until the journal is
+ flushed.
+
+* `owned_by_allocator`, which is true if an `open_bucket` referencing that
+ bucket exists - `open_buckets` represest buckets currently being allocated
+ from, they're reference counted to prevent buckets from being double allocated
+ (with writes taking a reference when they allocate space and releasing that
+ ref after they create the extent that points to that space).
+
+Currently, the main obstacle to getting rid of the bucket array is that we need
+to start the allocator threads to run journal replay, but we can't use btree
+iterators until journal replay has at least finished replaying updates to
+interior btree nodes.
+
+We have code in recovery.c for iterators that iterate over the btree with keys
+from the journal overlaid; it may be time to move this support into regular
+btree iterators in `btree_iter.c`.
+
+## WHAT WE NEED IN THE REWRITE:
+
+### Data structures:
+
+* buckets containing cached data that can be used if they are invalidated
+
+* buckets that are waiting on journal commit until they can be used
+
+* buckets ready to be discarded
+
+* buckets that are ready to be used now
+
+Buckets that are ready to be used now should be stored in sorted order.
+
+Buckets that are waiting on journal commit should be indexed by what journal
+seq needs to be flushed.
+
+Buckets that contain cached data should be stored on a heap.
+
+TASK LIST:
+
+* Move incrementing of bucket gens to happen when a bucket becomes empty (and
+ doesn't have an `open_bucket` pointing to it) - also need to add code to
+ `open_bucket_put()` to increment bucket gen if necessary
+
+* At the same time, add bucket to list of buckets awaiting journal commit
+
+* Add a flags field to `bch_alloc_v2` and a flag indicating bucket needs to be
+ discarded
+
+* Add a flag indicating whether a bucket had cached data in it - if not, we
+ don't need to track `oldest_gen`
+
+
+STATE TRANSITIONS:
+
+Cached data bucket -> empty uncommitted bucket
+Dirty data bucket -> empty uncommitted bucket
+
+In either case, the empty bucket first goes on the list of buckets awaiting
+journal commit.
+
+After journal commit:
+
+empty uncommited bucket -> empty undiscarded bucket
+
+After discard
diff --git a/Fsck.mdwn b/Fsck.mdwn
index a6fc1d1..79c76e9 100644
--- a/Fsck.mdwn
+++ b/Fsck.mdwn
@@ -29,6 +29,7 @@ What it does have to check is:
* Check that extents do not exist past inode's `i_size`
* Count extents for each inode and check inode's `i_sectors`.
+
* Check that root directory exists
* Check that lost+found exists
@@ -48,25 +49,28 @@ What it does have to check is:
Changes for snapshots:
======================
-Subvolumes vs. snapshots:
+## References between keys:
+
+Filesystem code operates within a subvolume: at the start of a btree transaction
+it looks up up the subvolume's snapshot ID, and passes that to the btree
+iterator code so that lookups always return the newest key visible to that
+snapshot ID.
-Filesystem code has to know what subvolume it is in, in order to fetch the
-current snapshot ID at the start of every btree transaction. Fsck code does not
-know or care what subvolume a given key belongs to - a key may belong to multiple
-subvolumes, if it's in a snapshot that's an interior node in the snapshots
-tree.
+The fsck code does not operate within a specific subvolume or snapshot ID - we
+have to iterate over and process keys in natural btree order, for performance.
+From fsck's point of view, any key that points to another key may now point to
+multiple versions of that key in different snapshots.
-Instead, when one key references another (e.g. a dirent that points to an
-inode), it does the lookup for the inode by searching for an inode with a
-snapshot ID equal to or an ancestor of the dirent key's snapshot ID.
-
-IOW, snaphsot IDs represent (branchable) "points in time", and fsck wants to
-check that every view of the filesystem, at every point in time, is consistent.
+For example, given a dirent that points to an inode, we search for the first
+inode with a snapshot ID equal to or an ancestor of the snapshot ID of the
+dirent - a normal `BTREE_ITER_FILTER_WHITEOUTS` lookup, but using the snapshot
+ID of the dirent, not the subvolume - this key should exist, if the dirent
+points to a valid inode. Additionally, the dirent also references inodes at that
+position in snapshot IDs that are descendends of the dirent's snapshot ID -
+provided a newer dirent in a snapshot ID equal to or ancestor of the inode's
+snapshot ID doesn't exist.
-Checking the filesystem for consistency in this fashion is conceptually
-straightforward - every key has a snapshot ID, and we use that when looking up
-anything that it references. However, we'll have to be careful about how we
-repair inconsistencies when the snapshot ID is an interior node:
+## Repair:
If we repair an inconsistency by deleting a key, we need to ensure that key is
not referenced by child snapshots. For example, if we find an extent that does
@@ -119,42 +123,3 @@ call. When we're dealing with a file that keys with many different snapshot IDs
we need to make sure we don't have any hidden O(n^2) algorithms, but if fsck
requires in the hundreds of megabytes of state to check these files with worst
case numbers of overwrites, that's not the end of the world.
-
-## Checking directory structure:
-
-The current approach of doing a DFS from the root directory still works with
-snapshots, but algorithmically is too slow. If we have many sparse snapshots,
-the DFS would recurse into each of those snapshots, walking directory structure
-it has already seen in the original subvolume - but those snapshots could have
-directory structure changes in one or two places, so it would have to. We need
-an algorithm that iterators over keys, not filesystem structure.
-
-Fortunately, since backpointers from inodes to dirents were recently added we
-can take a different approach. We walk the inodes btree, and for every directory
-inode we find we chase backpointers until we get to the root directory. Note
-that since previous passes have already checked backpointers, we're really just
-check for loops here. This will still check the same structure repeatedly, like
-the DFS approach, but here we can solve it by adding every inum:snapshot tuple
-to a hash table after we've verified it has a path to the root.
-
-## Checking inode link counts:
-
-This one will be tedious, but algorithmically doesn't change much. The existing
-code walks the dirents btree in natural key order, and then for each dirent
-increments a link count in a radix tree. After walking the dirents, we walk the
-inodes btree and compare link counts we calculated to what's in the inode.
-
-As before, we'll walk the dirents btree in natural key order. However, we can't
-accumulate link counts without also looking at what actually exists in the inodes
-btree: a dirent may point to multiple inodes with different snapshot IDs, and
-which of those we increment also depends on which snapshots that dirent is
-visible in (we'll already have a helper for that part, as discussed previously).
-
-This means that instead of using a radix tree, we'll have to start by walking
-the inodes btree, and for every inode we find creating a new in memory data
-structure (eytzinger layout search tree, perhaps) indexed by inode
-number:snapshot id and containing just the link count we'll be calculating.
-
-Fortunately hardlinks are usually rare in actual practice. We can make use of
-the fact that we've already checked inode backpointers and only include inodes
-that actually have hardlinks in this table.
diff --git a/Roadmap.mdwn b/Roadmap.mdwn
index 8cba90b..f00442d 100644
--- a/Roadmap.mdwn
+++ b/Roadmap.mdwn
@@ -111,7 +111,7 @@ index an entire SSD worth of blocks, and for persistent data structures btrees
have a number of advantages over hash tables (can do extents, and when access
patterns have locality they preserve that locality). Using singleton btrees is
a major simplification when writing a filesystem, but it does mean that
-everything is leaving rather heavily on the performance and scalability of the
+everything is leaning rather heavily on the performance and scalability of the
btree implementation.
Bcachefs b-trees are a hybrid compacting data structure - they share some
diff --git a/Snapshots.mdwn b/Snapshots.mdwn
index 6127ab4..bbd85be 100644
--- a/Snapshots.mdwn
+++ b/Snapshots.mdwn
@@ -28,10 +28,10 @@ Subvolumes are needed for two reasons:
* They're the mechanism for accessing snapshots
* Also, they're a way of isolating different parts of the filesystem hierarchy
- from snapshots, or taking snapshots that aren't global. I.e. you'd create a
- subvolume for your database so that filesystem snapshots don't COW it, or
- create subvolumes for each home directory so that users can snapshot their
- own home directories.
+ from snapshots, or taking snapshots that aren't global. I.e. you'd create a
+ subvolume for your database so that filesystem snapshots don't COW it, or
+ create subvolumes for each home directory so that users can snapshot their
+ own home directories.
The functionality and userspace interface for snapshots and subvolumes are
roughly modelled after btrfs, but simplified.
@@ -88,13 +88,13 @@ matches the specified snapshot ID, or the first ancestor if not found.
The update path, `bch2_trans_commit()` now has more work to do:
* When deleting, we have to check if there's another key at the same position
- in an ancestor snapshot that would become visible - if so, we need to insert
- a whiteout instead.
+ in an ancestor snapshot that would become visible - if so, we need to insert
+ a whiteout instead.
* When there was a key at the current position in an ancestor snapshot ID
- that's being overwritten (i.e. it will no longer be visible in the current
- snapshot) - we should also check if it's visible in any other snapshots, and
- if not, delete it.
+ that's being overwritten (i.e. it will no longer be visible in the current
+ snapshot) - we should also check if it's visible in any other snapshots, and
+ if not, delete it.
Extents turn out to not require much special handling: extent handling has
already been moved out of the core btree code, and there's now a pass at the top
@@ -194,6 +194,25 @@ accounting code uses an intricate system of percpu counters that are flushed
just prior to journal writes, but we should be able to extend the btree key
cache code to accommodate this.
+Right now, the number of inodes we count (as viewed with df -i) is just the
+number of keys in the inodes btree - i.e. number of inodes across all snapshots.
+This is probably not what we want.
+
+Breaking out accounting by snapshot ID will be straightforward, but what the
+user really wants is usage by subvolume, and it's not clear how to do that. Many
+keys with have a snapshot ID that is for an interior node in the snapshots tree,
+a node that isn't pointed to by any subvolume but has multiple child nodes owned
+by multiple subvolumes.
+
+What the user wants to know is "how much disk space will be freed up if I
+delete this snapshot/subvolume" - in particular, how much disk space historical
+snapshots are taking up. This is nontrivial.
+
+Suppose we're taking periodic RO snapshots, and then the snapshot tree is just a
+linear chain of nodes:
+
+a -> b -> c -> d -> e
+
Recursive snapshots:
====================
@@ -210,3 +229,8 @@ The fsck code is going to have to be reworked to use `BTREE_ITER_ALL_SNAPSHOTS`
and check keys in all snapshots all at once, in order to have acceptable
performance. This has not been started on yet, and is expected to be the
trickiest and most complicated part.
+
+Quotas:
+=======
+
+Todo