Add lock_blocking() to TrackedMutex: blocks current thread using block_in_place + futures::executor::block_on, safe for sync contexts. Replace all try_lock() calls with lock_blocking() in slash commands, UI rendering, and status reads. Lock hold times are fast enough that blocking briefly is fine, and this eliminates the spurious 'lock unavailable' paths that were never actually hit. Kept rx_mutex.try_lock() in mod.rs (std::sync::Mutex for stderr rx).
6.4 KiB
Bcachefs CI triage — 2026-04-20 autonomous session
Analysis of failures at f51f0a6b1a26 (BTREE_NODE_permanent). 74 fails / 12962 tests, but branch variance is 56-76 so the patch isn't a clear regression — just noise on top of existing bugs.
migrate_from_ext4 discard panic — root-cause hypothesis
Assertion (fs/bcachefs/alloc/discard.c:159):
Discarded bucket that is no longer BCH_DATA_need_discard!
bucket 0:36:0 data_type user dirty_sectors 2016
need_discard 1 need_inc_gen 1
journal_seq_nonempty 95 journal_seq_empty 181
Your commit c84503104e6a (Apr 18) moved this check from recoverable (bch2_fs_emergency_read_only) to hard panic() and also moved bch2_bucket_is_open_safe() to AFTER locking the alloc key. The emergency-RO path existed before — this pre-existing race was being swallowed quietly; now it's loud.
Race mechanism (hypothesis):
bch2_discard_one_bucketreads alloc key, confirmsdata_type == need_discard- Calls
discard_in_flight_add(check=false)to register in in_flight bch2_trans_unlock(trans)— releases btree lock (line 313)discard_submit(ca, bucket, fastpath)— physical bio dispatched, takes milliseconds- During bio flight:
migratetool writes an alloc key for bucket 36 withdata_type=user(claiming it holds ext4 data).NEED_DISCARD=1flag remains because migrate doesn't clear it. - Bio completes →
discard_endio→discard_mark_freere-reads alloc key → seesdata_type=user→ panic
Why migrate bypasses the normal allocator gate:
bcachefs migrate is an in-place ext4→bcachefs conversion. It can't go through the normal allocator (pick free bucket from freespace btree) because specific physical bucket locations already contain ext4 data that must be preserved at their physical positions. migrate writes alloc keys directly for the buckets ext4 was using.
Bucket 36 got caught: initial bcachefs format marked it need_discard (safety), kernel discard worker saw it and started physical discard, meanwhile userspace migrate claimed it for user data.
If this is right, physical data safety is at risk: after the physical discard completes, the bucket's sectors are whatever the SSD returns post-discard (zero, old data, garbage — device-dependent). migrate set alloc keys pointing at "user data" in those sectors. The data migrate wanted to preserve may already be GONE at that point.
Candidate fixes (for Kent to evaluate):
-
Cleanest, but requires userspace change:
bcachefs migrateshould either (a) format the new bcachefs without marking buckets need_discard (the data isn't deallocated, it's being claimed) OR (b) wait for pending discards to drain before writing any alloc keys. -
Kernel-side hardening:
bch2_discard_one_bucketshould hold the alloc key locked through the bio dispatch. Requires not unlocking betweendiscard_in_flight_addanddiscard_submit. Will hurt concurrency but prevents the race. -
Kernel-side graceful handling: in
discard_mark_free, after bio completion, if the currentdata_type != need_discard(bucket was reclaimed during bio flight), don't mark it free — but also don't panic. Note that the physical data is still gone; we should log-warn and mark the bucket bad / needs-recovery. Not ideal but at least not a hard panic. -
Stronger kernel gate: add a check in the allocator (or wherever migrate writes alloc keys go through) that refuses to allocate/claim a bucket currently in in_flight discard list. This would require the allocator to consult
d->in_flight— currently it doesn't.
My recommendation: (1) is cleanest if migrate is doing something wrong. (2) hurts perf but is most defensive. (4) is the most principled kernel-side fix.
ec.device_remove_offline — partial analysis
The test checks ptr_to_removed_device fsck error count after device-remove. Expected 0, got 2. ptr_to_removed_device is flagged in fs/bcachefs/alloc/buckets.c:134 when fsck is marking extents/keys and sees a pointer to a device in c->devs_removed.d.
From the test log just before shutdown:
error retrying stripe: stripe_needs_block_evacuate
u64s 23 type stripe 0:152:0 ...
255:632832 gen 0#16 ← pointer to removed dev (id 255 = tombstone)
vdf 4:308:0 gen 0#1536 ← actual block ptrs on surviving devs
vdd 2:309:0 gen 0#2048
vde 3:309:0 gen 0#2048
vdc 1:309:0 gen 0#0
The stripe has 4 data blocks on vdf/vdd/vde/vdc (surviving devices) — those are fine. But the stripe key itself still has a pointer to device 255 (the removed device, device-remove uses id 255 as tombstone).
My read: the stripe-block-evacuate logic moves DATA blocks off a removed device, but doesn't remove the stripe's own self-referential pointer to the removed device. Two such stripes remain with this dangling ptr → fsck catches 2 ptr_to_removed_device errors → test counter = 2.
Candidate fix area: look at where stripe metadata keys get their pointers updated during device removal. The evacuate path probably needs to also rewrite the stripe's own pointer list, or the device-removal cleanup should iterate stripes and drop-ptr for the removed dev.
Search for: bch2_stripe_* in fs/bcachefs/data/ec/ — particularly any path that handles "stripe needs block evacuate" completion.
kill_btree_node — not dug into yet
fsck fixes errors first run, dry-run fsck (fsck -ny) reports errors still exist. Either fsck has a bug where repair-mode and check-only-mode disagree on what counts as an error, or a repair pass reintroduces what a later pass fixes. Needs more time than I have before compaction.
kill_btree_node — next to look at
fsck fixes errors first run, dry-run fsck (fsck -ny) reports errors still exist. Either fsck has a bug where repair-mode and check-only-mode disagree on what counts as an error, or a repair pass reintroduces what a later pass fixes.
Not-looking-at
generic/503DIO lost wakeup — needs Kent's DIO code contextgeneric/585rw-sem deadlock — needs runtime statereplicas_write_errorsallocator hang — needs degraded-write accounting understandingevacuate_errorsdata corruption — too deepstress_ngKASAN insysctl_sys_info_handler— upstream kernel bug, not bcachefs
Branch noise context
Failure counts across recent commits: 56, 61, 62, 64, 69, 74, 76. The f51f0a6 (permanent patch) sits at 74, within normal variance. No clear regression from the patch itself.