Kent Overstreet 4225294d16 replace try_lock() with lock_blocking() across UI thread

Add lock_blocking() to TrackedMutex: blocks current thread using
block_in_place + futures::executor::block_on, safe for sync contexts.

Replace all try_lock() calls with lock_blocking() in slash commands,
UI rendering, and status reads. Lock hold times are fast enough that
blocking briefly is fine, and this eliminates the spurious 'lock
unavailable' paths that were never actually hit.

Kept rx_mutex.try_lock() in mod.rs (std::sync::Mutex for stderr rx).

2026-04-25 15:35:14 -04:00

6.4 KiB

Raw Blame History

Bcachefs CI triage — 2026-04-20 autonomous session

Analysis of failures at f51f0a6b1a26 (BTREE_NODE_permanent). 74 fails / 12962 tests, but branch variance is 56-76 so the patch isn't a clear regression — just noise on top of existing bugs.

migrate_from_ext4 discard panic — root-cause hypothesis

Assertion (fs/bcachefs/alloc/discard.c:159):

Discarded bucket that is no longer BCH_DATA_need_discard!
bucket 0:36:0  data_type user  dirty_sectors 2016
need_discard 1  need_inc_gen 1
journal_seq_nonempty 95  journal_seq_empty 181

Your commit c84503104e6a (Apr 18) moved this check from recoverable (bch2_fs_emergency_read_only) to hard panic() and also moved bch2_bucket_is_open_safe() to AFTER locking the alloc key. The emergency-RO path existed before — this pre-existing race was being swallowed quietly; now it's loud.

Race mechanism (hypothesis):

bch2_discard_one_bucket reads alloc key, confirms data_type == need_discard
Calls discard_in_flight_add(check=false) to register in in_flight
bch2_trans_unlock(trans) — releases btree lock (line 313)
discard_submit(ca, bucket, fastpath) — physical bio dispatched, takes milliseconds
During bio flight: migrate tool writes an alloc key for bucket 36 with data_type=user (claiming it holds ext4 data). NEED_DISCARD=1 flag remains because migrate doesn't clear it.
Bio completes → discard_endio → discard_mark_free re-reads alloc key → sees data_type=user → panic

Why migrate bypasses the normal allocator gate:

bcachefs migrate is an in-place ext4→bcachefs conversion. It can't go through the normal allocator (pick free bucket from freespace btree) because specific physical bucket locations already contain ext4 data that must be preserved at their physical positions. migrate writes alloc keys directly for the buckets ext4 was using.

Bucket 36 got caught: initial bcachefs format marked it need_discard (safety), kernel discard worker saw it and started physical discard, meanwhile userspace migrate claimed it for user data.

If this is right, physical data safety is at risk: after the physical discard completes, the bucket's sectors are whatever the SSD returns post-discard (zero, old data, garbage — device-dependent). migrate set alloc keys pointing at "user data" in those sectors. The data migrate wanted to preserve may already be GONE at that point.

Candidate fixes (for Kent to evaluate):

Cleanest, but requires userspace change: bcachefs migrate should either (a) format the new bcachefs without marking buckets need_discard (the data isn't deallocated, it's being claimed) OR (b) wait for pending discards to drain before writing any alloc keys.
Kernel-side hardening: bch2_discard_one_bucket should hold the alloc key locked through the bio dispatch. Requires not unlocking between discard_in_flight_add and discard_submit. Will hurt concurrency but prevents the race.
Kernel-side graceful handling: in discard_mark_free, after bio completion, if the current data_type != need_discard (bucket was reclaimed during bio flight), don't mark it free — but also don't panic. Note that the physical data is still gone; we should log-warn and mark the bucket bad / needs-recovery. Not ideal but at least not a hard panic.
Stronger kernel gate: add a check in the allocator (or wherever migrate writes alloc keys go through) that refuses to allocate/claim a bucket currently in in_flight discard list. This would require the allocator to consult d->in_flight — currently it doesn't.

My recommendation: (1) is cleanest if migrate is doing something wrong. (2) hurts perf but is most defensive. (4) is the most principled kernel-side fix.

ec.device_remove_offline — partial analysis

The test checks ptr_to_removed_device fsck error count after device-remove. Expected 0, got 2. ptr_to_removed_device is flagged in fs/bcachefs/alloc/buckets.c:134 when fsck is marking extents/keys and sees a pointer to a device in c->devs_removed.d.

From the test log just before shutdown:

error retrying stripe: stripe_needs_block_evacuate
  u64s 23 type stripe 0:152:0 ...
      255:632832 gen 0#16       ← pointer to removed dev (id 255 = tombstone)
         vdf 4:308:0 gen 0#1536  ← actual block ptrs on surviving devs
         vdd 2:309:0 gen 0#2048
         vde 3:309:0 gen 0#2048
         vdc 1:309:0 gen 0#0

The stripe has 4 data blocks on vdf/vdd/vde/vdc (surviving devices) — those are fine. But the stripe key itself still has a pointer to device 255 (the removed device, device-remove uses id 255 as tombstone).

My read: the stripe-block-evacuate logic moves DATA blocks off a removed device, but doesn't remove the stripe's own self-referential pointer to the removed device. Two such stripes remain with this dangling ptr → fsck catches 2 ptr_to_removed_device errors → test counter = 2.

Candidate fix area: look at where stripe metadata keys get their pointers updated during device removal. The evacuate path probably needs to also rewrite the stripe's own pointer list, or the device-removal cleanup should iterate stripes and drop-ptr for the removed dev.

Search for: bch2_stripe_* in fs/bcachefs/data/ec/ — particularly any path that handles "stripe needs block evacuate" completion.

kill_btree_node — not dug into yet

fsck fixes errors first run, dry-run fsck (fsck -ny) reports errors still exist. Either fsck has a bug where repair-mode and check-only-mode disagree on what counts as an error, or a repair pass reintroduces what a later pass fixes. Needs more time than I have before compaction.

kill_btree_node — next to look at

Not-looking-at

generic/503 DIO lost wakeup — needs Kent's DIO code context
generic/585 rw-sem deadlock — needs runtime state
replicas_write_errors allocator hang — needs degraded-write accounting understanding
evacuate_errors data corruption — too deep
stress_ng KASAN in sysctl_sys_info_handler — upstream kernel bug, not bcachefs

Branch noise context

Failure counts across recent commits: 56, 61, 62, 64, 69, 74, 76. The f51f0a6 (permanent patch) sits at 74, within normal variance. No clear regression from the patch itself.

6.4 KiB Raw Blame History