consciousness/ci-triage-2026-04-20.md

# Bcachefs CI triage — 2026-04-20 autonomous session

Analysis of failures at `f51f0a6b1a26` (BTREE_NODE_permanent). 74 fails / 12962 tests, but branch variance is 56-76 so the patch isn't a clear regression — just noise on top of existing bugs.

## migrate_from_ext4 discard panic — root-cause hypothesis

**Assertion (fs/bcachefs/alloc/discard.c:159):**
```
Discarded bucket that is no longer BCH_DATA_need_discard!
bucket 0:36:0  data_type user  dirty_sectors 2016
need_discard 1  need_inc_gen 1
journal_seq_nonempty 95  journal_seq_empty 181
```

**Your commit c84503104e6a (Apr 18)** moved this check from recoverable (`bch2_fs_emergency_read_only`) to hard `panic()` and also moved `bch2_bucket_is_open_safe()` to AFTER locking the alloc key. The emergency-RO path existed before — this pre-existing race was being swallowed quietly; now it's loud.

**Race mechanism (hypothesis):**

1. `bch2_discard_one_bucket` reads alloc key, confirms `data_type == need_discard`
2. Calls `discard_in_flight_add(check=false)` to register in in_flight
3. **`bch2_trans_unlock(trans)` — releases btree lock** (line 313)
4. `discard_submit(ca, bucket, fastpath)` — physical bio dispatched, takes milliseconds
5. During bio flight: `migrate` tool writes an alloc key for bucket 36 with `data_type=user` (claiming it holds ext4 data). `NEED_DISCARD=1` flag remains because migrate doesn't clear it.
6. Bio completes → `discard_endio` → `discard_mark_free` re-reads alloc key → sees `data_type=user` → **panic**

**Why migrate bypasses the normal allocator gate:**

`bcachefs migrate` is an in-place ext4→bcachefs conversion. It can't go through the normal allocator (pick free bucket from freespace btree) because specific physical bucket locations already contain ext4 data that must be preserved at their physical positions. migrate writes alloc keys directly for the buckets ext4 was using.

Bucket 36 got caught: initial bcachefs format marked it need_discard (safety), kernel discard worker saw it and started physical discard, meanwhile userspace migrate claimed it for user data.

**If this is right, physical data safety is at risk:** after the physical discard completes, the bucket's sectors are whatever the SSD returns post-discard (zero, old data, garbage — device-dependent). migrate set alloc keys pointing at "user data" in those sectors. The data migrate wanted to preserve may already be GONE at that point.

**Candidate fixes (for Kent to evaluate):**

1. **Cleanest, but requires userspace change:** `bcachefs migrate` should either (a) format the new bcachefs without marking buckets need_discard (the data isn't deallocated, it's being claimed) OR (b) wait for pending discards to drain before writing any alloc keys.

2. **Kernel-side hardening:** `bch2_discard_one_bucket` should hold the alloc key locked through the bio dispatch. Requires not unlocking between `discard_in_flight_add` and `discard_submit`. Will hurt concurrency but prevents the race.

3. **Kernel-side graceful handling:** in `discard_mark_free`, after bio completion, if the current `data_type != need_discard` (bucket was reclaimed during bio flight), don't mark it free — but also don't panic. Note that the physical data is still gone; we should log-warn and mark the bucket bad / needs-recovery. Not ideal but at least not a hard panic.

4. **Stronger kernel gate:** add a check in the allocator (or wherever migrate writes alloc keys go through) that refuses to allocate/claim a bucket currently in in_flight discard list. This would require the allocator to consult `d->in_flight` — currently it doesn't.

My recommendation: (1) is cleanest if migrate is doing something wrong. (2) hurts perf but is most defensive. (4) is the most principled kernel-side fix.

## ec.device_remove_offline — partial analysis

The test checks `ptr_to_removed_device` fsck error count after device-remove. Expected 0, got 2. `ptr_to_removed_device` is flagged in `fs/bcachefs/alloc/buckets.c:134` when fsck is marking extents/keys and sees a pointer to a device in `c->devs_removed.d`.

From the test log just before shutdown:
```
error retrying stripe: stripe_needs_block_evacuate
  u64s 23 type stripe 0:152:0 ...
      255:632832 gen 0#16       ← pointer to removed dev (id 255 = tombstone)
         vdf 4:308:0 gen 0#1536  ← actual block ptrs on surviving devs
         vdd 2:309:0 gen 0#2048
         vde 3:309:0 gen 0#2048
         vdc 1:309:0 gen 0#0
```

The stripe has 4 data blocks on vdf/vdd/vde/vdc (surviving devices) — those are fine. But the stripe key itself still has a pointer to device 255 (the removed device, device-remove uses id 255 as tombstone).

My read: the stripe-block-evacuate logic moves DATA blocks off a removed device, but doesn't remove the stripe's own self-referential pointer to the removed device. Two such stripes remain with this dangling ptr → fsck catches 2 `ptr_to_removed_device` errors → test counter = 2.

Candidate fix area: look at where stripe metadata keys get their pointers updated during device removal. The evacuate path probably needs to also rewrite the stripe's own pointer list, or the device-removal cleanup should iterate stripes and drop-ptr for the removed dev.

Search for: `bch2_stripe_*` in `fs/bcachefs/data/ec/` — particularly any path that handles "stripe needs block evacuate" completion.

## kill_btree_node — not dug into yet

fsck fixes errors first run, dry-run fsck (`fsck -ny`) reports errors still exist. Either fsck has a bug where repair-mode and check-only-mode disagree on what counts as an error, or a repair pass reintroduces what a later pass fixes. Needs more time than I have before compaction.

## kill_btree_node — next to look at

fsck fixes errors first run, dry-run fsck (`fsck -ny`) reports errors still exist. Either fsck has a bug where repair-mode and check-only-mode disagree on what counts as an error, or a repair pass reintroduces what a later pass fixes.

## Not-looking-at

- `generic/503` DIO lost wakeup — needs Kent's DIO code context
- `generic/585` rw-sem deadlock — needs runtime state
- `replicas_write_errors` allocator hang — needs degraded-write accounting understanding
- `evacuate_errors` data corruption — too deep
- `stress_ng` KASAN in `sysctl_sys_info_handler` — upstream kernel bug, not bcachefs

## Branch noise context

Failure counts across recent commits: 56, 61, 62, 64, 69, 74, 76. The f51f0a6 (permanent patch) sits at 74, within normal variance. No clear regression from the patch itself.