summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-06-22bcachefs: bch2_verify_accounting_clean()Kent Overstreet
Verify that the in-memory accounting verifies the on-disk accounting after a clean shutdown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Convert bch2_replicas_gc2() to new accountingKent Overstreet
bch2_replicas_gc2() is used for garbage collection superblock replicas entries that are empty - this converts it to the new accounting scheme. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Convert gc to new accountingKent Overstreet
Rewrite fsck/gc for the new accounting scheme. This adds a second set of in-memory accounting counters for gc to use; like with other parts of gc we run all trigger in TRIGGER_GC mode, then compare what we calculated to existing in-memory accounting at the end. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Kill replicas_journal_resKent Overstreet
More dead code deletion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Kill fs_usage_onlineKent Overstreet
More dead code deletion. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Kill bch2_fs_usage_to_text()Kent Overstreet
Dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Delete journal-buf-sharded old style accountingKent Overstreet
More deletion of dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Kill writing old accounting to journalKent Overstreet
More ripping out of the old disk space accounting. Note that the new disk space accounting is incompatible with the old, and writing out old style disk space accounting with the new code is infeasible. This means upgrading and downgrading past this version requires regenerating accounting. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: kill bch2_fs_usage_read()Kent Overstreet
With bch2_ioctl_fs_usage(), this is now dead code. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Convert bch2_ioctl_fs_usage() to new accountingKent Overstreet
This converts bch2_ioctl_fs_usage() to read from the new disk accounting, via bch2_fs_replicas_usage_read(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Kill bch2_fs_usage_initialize()Kent Overstreet
Deleting code for the old disk accounting scheme. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: dev_usage updated by new accountingKent Overstreet
Reading disk accounting now requires an eytzinger lookup (see: bch2_accounting_mem_read()), but the per-device counters are used frequently enough that we'd like to still be able to read them with just a percpu sum, as in the old code. This patch special cases the device counters; when we update in-memory accounting we also update the old style percpu counters if it's a deice counter update. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Coalesce accounting keys before journal replayKent Overstreet
This fixes a performance regression in journal replay; without colaescing accounting keys we have multiple keys at the same position, which means journal_keys_peek_upto() has to skip past many overwritten keys - turning journal replay into an O(n^2) algorithm. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Disk space accounting rewriteKent Overstreet
Main part of the disk accounting rewrite. This is a wholesale rewrite of the existing disk space accounting, which relies on percepu counters that are sharded by journal buffer, and rolled up and added to each journal write. With the new scheme, every set of counters is a distinct key in the accounting btree; this fixes scaling limitations of the old scheme, where counters took up space in each journal entry and required multiple percpu counters. Now, in memory accounting requires a single set of percpu counters - not multiple for each in flight journal buffer - and in the future we'll probably also have counters that don't use in memory percpu counters, they're not strictly required. An accounting update is now a normal btree update, using the btree write buffer path. At transaction commit time, we apply accounting updates to the in memory counters, which are percpu counters indexed in an eytzinger tree by the accounting key. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: btree write buffer knows how to accumulate bch_accounting keysKent Overstreet
Teach the btree write buffer how to accumulate accounting keys - instead of having the newer key overwrite the older key as we do with other updates, we need to add them together. Also, add a flag so that write buffer flush knows when journal replay is finished flushing accounting, and teach it to hold accounting keys until that flag is set. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Accumulate accounting keys in journal replayKent Overstreet
Until accounting keys hit the btree, they are deltas, not new versions of the existing key; this means we have to teach journal replay to accumulate them. Additionally, the journal doesn't track precisely which entries have been flushed to the btree; it only tracks a range of entries that may possibly still need to be flushed. That means we need to compare accounting keys against the version in the btree and only flush updates that are newer. There's another wrinkle with the write buffer: if the write buffer starts flushing accounting keys before journal replay has finished flushing accounting keys, journal replay will see the version number from the new updates and updates from the journal will be lost. To avoid this, journal replay has to flush accounting keys first, and we'll be adding a flag so that write buffer flush knows to hold accounting keys until then. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: KEY_TYPE_accountingKent Overstreet
New key type for the disk space accounting rewrite. - Holds a variable sized array of u64s (may be more than one for accounting e.g. compressed and uncompressed size, or buckets and sectors for a given data type) - Updates are deltas, not new versions of the key: this means updates to accounting can happen via the btree write buffer, which we'll be teaching to accumulate deltas. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: use new mount APIThomas Bertschinger
This updates bcachefs to use the new mount API: - Update the file_system_type to use the new init_fs_context() function. - Define the new fs_context_operations functions. - No longer register bch2_mount() and bch2_remount(); these are now called via the new fs_context functions. - Define a new helper type, bch2_opts_parse that includes a struct bch_opts and additionally a printbuf used to save options that can't be parsed until after the FS is opened. This enables us to parse as many options as possible prior to opening the filesystem while saving those options that need the open FS for later parsing. Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Add error code to defer option parsingThomas Bertschinger
This introduces a new error code, option_needs_open_fs, which is used to indicate that an attempt was made to parse a mount option prior to opening a filesystem, when that mount option requires an open filesystem in order to be validated. Returning this error results in bch2_parse_one_mount_opt() saving that option for later parsing, after the filesystem is opened. Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: add printbuf arg to bch2_parse_mount_opts()Thomas Bertschinger
Mount options that take the name of a device that may be part of a filesystem, for example "metadata_target", cannot be validated until after the filesystem has been opened. However, an attempt to parse those options may be made prior to the filesystem being opened. This change adds a printbuf parameter to bch2_parse_mount_opts() which will be used to save those mount options, when they are supplied prior to the FS being opened, so that they can be parsed later. This functionality is not currently needed, but will be used after bcachefs starts using the new mount API to parse mount options. This is because using the new mount API, we will process mount options prior to opening the FS, but the new API doesn't provide a convenient way to "replay" mount option parsing. So we save these options ourselves to accomplish this. This change also splits out the code to parse a single option into bch2_parse_one_mount_opt(), which will be useful when using the new mount API which deals with a single mount option at a time. Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: metadata version bucket_stripe_sectorsKent Overstreet
New on disk format version for bch_alloc->stripe_sectors and BCH_DATA_unstriped - accounting for unstriped data in stripe buckets. Upgrade/downgrade requires regenerating alloc info - but only if erasure coding is in use. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: BCH_DATA_unstripedKent Overstreet
Add a new pseudo data type, to track buckets that are members of a stripe, but have unstriped data in them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: bch_alloc->stripe_sectorsKent Overstreet
Add a separate counter to bch_alloc_v4 for amount of striped data; this lets us separately track striped and unstriped data in a bucket, which lets us see when erasure coding has failed to update extents with stripe pointers, and also find buckets to continue updating if we crash mid way through creating a new stripe. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: check_key_has_inode()Kent Overstreet
Consolidate duplicated checks for extents/dirents/xattrs - these keys should all have a corresponding inode of the correct type. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: allow passing full device path for target optionsThomas Bertschinger
The output of mount options such as "metadata_target" in `/proc/mounts` uses the full path to the device. mount(8) from util-linux uses the output from `/proc/mounts` to pass existing mount options when performing a remount, so bcachefs should accept as input the same form that it prints as output. Without this change: $ mount -t bcachefs -o metadata_target=vdb /dev/vdb /mnt $ strace mount -o remount /mnt ... fsconfig(4, FSCONFIG_SET_STRING, "metadata_target", "/dev/vdb", 0) = -1 EINVAL (Invalid argument) ... Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: bch2_printbuf_strip_trailing_newline()Kent Overstreet
Add a new helper to fix inode_to_text() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: don't expose "read_only" as a mount optionThomas Bertschinger
When "read_only" is exposed as a mount option, it is redundant with the standard option "ro" and gives users multiple ways to specify that a bcachefs filesystem should be mounted read-only. This presents the risk of having inconsistent options specified. This can be seen when remounting a read-only filesystem in read-write mode, using mount(8) from util-linux. Because mount(8) parses the existing mount options from `/proc/mounts` and applies them when remounting, it can end up applying both "read_only" and "rw": $ mount img -o ro /mnt $ strace mount -o remount,rw /mnt ... fsconfig(4, FSCONFIG_SET_FLAG, "read_only", NULL, 0) = 0 fsconfig(4, FSCONFIG_SET_FLAG, "rw", NULL, 0) = 0 ... Making "read_only" no longer a mount option means this edge case cannot occur. Fixes: 62719cf33c3a ("bcachefs: Fix nochanges/read_only interaction") Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: make offline fsck set read_only fs flagThomas Bertschinger
A subsequent change will remove "read_only" as a mount option in favor of the standard option "ro", meaning the userspace fsck command cannot pass it to the fsck ioctl. Instead, in offline fsck, set "read_only" kernel-side without trying to parse it as a mount option. For compatibility with versions of the "bcachefs fsck" command that try to pass the "read_only" mount opt, remove it from the mount options string prior to parsing when it is present. Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: btree_ptr_sectors_written() now takes bkey_s_cKent Overstreet
this is for the userspace metadata dump tool Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Check for bsets past bch_btree_ptr_v2.sectors_writtenKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Use try_cmpxchg() family of functions instead of cmpxchg()Uros Bizjak
Use try_cmpxchg() family of functions instead of cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: add might_sleep() annotations for fsck_err()Kent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: fix missing includeKent Overstreet
fs-common.h needs dirent.h for enum bch_rename_mode Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Use filemap_read() to simplify the execution flowYouling Tang
Using filemap_read() can reduce unnecessary code execution for non IOCB_DIRECT paths. Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Align the display format of `btrees/inodes/keys`Youling Tang
Before patch: ``` #cat btrees/inodes/keys u64s 17 type inode_v3 0:4096:U32_MAX len 0 ver 0: mode=40755 flags= (16300000) bi_size=0 ``` After patch: ``` #cat btrees/inodes/keys u64s 17 type inode_v3 0:4096:U32_MAX len 0 ver 0: mode=40755 flags=(16300000) bi_size=0 ``` Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Fix missing spaces in journal_entry_dev_usage_to_textYouling Tang
Fixed missing spaces displayed in journal_entry_dev_usage_to_text while adjusting the display format to improve readability. before: ``` # bcachefs list_journal -a -t alloc:1:0 /dev/sdb ... dev_usage: dev=0free: buckets=233180 sectors=0 fragmented=0sb: buckets=13 sectors=6152 fragmented=504journal: buckets=1847 sectors=945664 fragmented=0btree: buckets=20 sectors=10240 fragmented=0user: buckets=1419 sectors=726513 fragmented=15cached: buckets=0 sectors=0 fragmented=0parity: buckets=0 sectors=0 fragmented=0stripe: buckets=0 sectors=0 fragmented=0need_gc_gens: buckets=0 sectors=0 fragmented=0need_discard: buckets=1 sectors=0 fragmented=0 ``` after: ``` # bcachefs list_journal -a -t alloc:1:0 /dev/sdb ... dev_usage: dev=0 free: buckets=233180 sectors=0 fragmented=0 sb: buckets=13 sectors=6152 fragmented=504 journal: buckets=1847 sectors=945664 fragmented=0 btree: buckets=20 sectors=10240 fragmented=0 user: buckets=1419 sectors=726513 fragmented=15 cached: buckets=0 sectors=0 fragmented=0 parity: buckets=0 sectors=0 fragmented=0 stripe: buckets=0 sectors=0 fragmented=0 need_gc_gens: buckets=0 sectors=0 fragmented=0 need_discard: buckets=1 sectors=0 fragmented=0 ``` Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-22bcachefs: Fix freeing of error pointersKent Overstreet
This fixes incorrect/missign checking of strndup_user() returns. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21bcachefs: Move the ei_flags setting to after initializationbcachefs-2024-06-22Youling Tang
`inode->ei_flags` setting and cleaning should be done after initialization, otherwise the operation is invalid. Fixes: 9ca4853b98af ("bcachefs: Fix quota support for snapshots") Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21bcachefs: Fix a UAF after write_super()Kent Overstreet
write_super() may reallocate the superblock buffer - but bch_sb_field_ext was referencing it; don't use it after the write_super call. Reported-by: syzbot+8992fc10a192067b8d8a@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21bcachefs: Use bch2_print_string_as_lines for long errKent Overstreet
printk strings get truncated to 1024 bytes; if we have a long error message (journal debug info) we need to use a helper. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21bcachefs: Fix I_NEW warning in race path in bch2_inode_insert()Kent Overstreet
discard_new_inode() is the correct interface for tearing down an indoe that was fully created but not made visible to other threads, but it expects I_NEW to be set, which we don't use. Reported-by: https://github.com/koverstreet/bcachefs/issues/690 Fixes: bcachefs: Fix race path in bch2_inode_insert() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21bcachefs: Replace bare EEXIST with private error codesKent Overstreet
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21bcachefs: Fix missing alloc_data_type_set()Kent Overstreet
Incorrect bucket state transition in the discard path; when incrementing a bucket's generation number that had already been discarded, we were forgetting to check if it should be need_gc_gens, not free. This was caught by the .invalid checks in the transaction commit path, causing us to go emergency read only. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-21closures: Change BUG_ON() to WARN_ON()Kent Overstreet
If a BUG_ON() can be hit in the wild, it shouldn't be a BUG_ON() For reference, this has popped up once in the CI, and we'll need more info to debug it: 03240 ------------[ cut here ]------------ 03240 kernel BUG at lib/closure.c:21! 03240 kernel BUG at lib/closure.c:21! 03240 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP 03240 Modules linked in: 03240 CPU: 15 PID: 40534 Comm: kworker/u80:1 Not tainted 6.10.0-rc4-ktest-ga56da69799bd #25570 03240 Hardware name: linux,dummy-virt (DT) 03240 Workqueue: btree_update btree_interior_update_work 03240 pstate: 00001005 (nzcv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--) 03240 pc : closure_put+0x224/0x2a0 03240 lr : closure_put+0x24/0x2a0 03240 sp : ffff0000d12071c0 03240 x29: ffff0000d12071c0 x28: dfff800000000000 x27: ffff0000d1207360 03240 x26: 0000000000000040 x25: 0000000000000040 x24: 0000000000000040 03240 x23: ffff0000c1f20180 x22: 0000000000000000 x21: ffff0000c1f20168 03240 x20: 0000000040000000 x19: ffff0000c1f20140 x18: 0000000000000001 03240 x17: 0000000000003aa0 x16: 0000000000003ad0 x15: 1fffe0001c326974 03240 x14: 0000000000000a1e x13: 0000000000000000 x12: 1fffe000183e402d 03240 x11: ffff6000183e402d x10: dfff800000000000 x9 : ffff6000183e402e 03240 x8 : 0000000000000001 x7 : 00009fffe7c1bfd3 x6 : ffff0000c1f2016b 03240 x5 : ffff0000c1f20168 x4 : ffff6000183e402e x3 : ffff800081391954 03240 x2 : 0000000000000001 x1 : 0000000000000000 x0 : 00000000a8000000 03240 Call trace: 03240 closure_put+0x224/0x2a0 03240 bch2_check_for_deadlock+0x910/0x1028 03240 bch2_six_check_for_deadlock+0x1c/0x30 03240 six_lock_slowpath.isra.0+0x29c/0xed0 03240 six_lock_ip_waiter+0xa8/0xf8 03240 __bch2_btree_node_lock_write+0x14c/0x298 03240 bch2_trans_lock_write+0x6d4/0xb10 03240 __bch2_trans_commit+0x135c/0x5520 03240 btree_interior_update_work+0x1248/0x1c10 03240 process_scheduled_works+0x53c/0xd90 03240 worker_thread+0x370/0x8c8 03240 kthread+0x258/0x2e8 03240 ret_from_fork+0x10/0x20 03240 Code: aa1303e0 d63f0020 a94363f7 17ffff8c (d4210000) 03240 ---[ end trace 0000000000000000 ]--- 03240 Kernel panic - not syncing: Oops - BUG: Fatal exception 03240 SMP: stopping secondary CPUs 03241 SMP: failed to stop secondary CPUs 13,15 03241 Kernel Offset: disabled 03241 CPU features: 0x00,00000003,80000008,4240500b 03241 Memory Limit: none 03241 ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- 03246 ========= FAILED TIMEOUT copygc_torture_no_checksum in 7200s Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-20bcachefs: fix alignment of VMA for memory mapped files on THPYouling Tang
With CONFIG_READ_ONLY_THP_FOR_FS, the Linux kernel supports using THPs for read-only mmapped files, such as shared libraries. However, the kernel makes no attempt to actually align those mappings on 2MB boundaries, which makes it impossible to use those THPs most of the time. This issue applies to general file mapping THP as well as existing setups using CONFIG_READ_ONLY_THP_FOR_FS. This is easily fixed by using thp_get_unmapped_area for the unmapped_area function in bcachefs, which is what ext2, ext4, fuse, xfs and btrfs all use. Similar to commit b0c582233a85 ("btrfs: fix alignment of VMA for memory mapped files on THP"). Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-20bcachefs: Fix safe errors by defaultKent Overstreet
i.e. the start of automatic self healing: If errors=continue or fix_safe, we now automatically fix simple errors without user intervention. New error action option: fix_safe This replaces the existing errors=ro option, which gets a new slot, i.e. existing errors=ro users now get errors=fix_safe. This is currently only enabled for a limited set of errors - initially just disk accounting; errors we would never not want to fix, and we don't want to require user intervention (i.e. to make sure a bug report gets filed). Errors will still be counted in the superblock, so we (developers) will still know they've been occuring if a bug report gets filed (as bug reports typically include the errors superblock section). Eventually we'll be enabling this for a much wider set of errors, after we've done thorough error injection testing. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19bcachefs: Fix bch2_trans_put()Kent Overstreet
reference: https://github.com/koverstreet/bcachefs/issues/692 trans->ref is the reference used by the cycle detector, which walks btree_trans objects of other threads to walk the graph of held locks and issue wakeups when an abort is required. We have to wait for the ref to go to 1 before freeing trans->paths or clearing trans->locking_wait.task. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19bcachefs: set_worker_desc() for delete_dead_snapshotsKent Overstreet
this is long running - help users see what's going on Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19bcachefs: Fix bch2_sb_downgrade_update()Kent Overstreet
Missing enum conversion Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-06-19bcachefs: Handle cached data LRU wraparoundKent Overstreet
We only have 48 bits for the LRU time field, which is insufficient to prevent wraparound. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>