consciousness/research/issue-1107-analysis.md

# Issue #1107 Analysis: kernel BUG at key_cache.c:475

## Summary
BUG_ON fires during degraded mount with 8 disks when flushing key cache during recovery.

## Timeline from dmesg
1. Unclean shutdown recovery begins
2. "journal bucket seqs not monotonic" on 5 devices
3. 22M journal keys replayed (29M read, 22M after compaction)
4. `check_allocations` finds buckets "missing in alloc btree"
5. Goes read-write
6. EC stripe read errors spam (`__ec_stripe_create: error reading stripe`)
7. **"btree node header doesn't match ptr: btree=alloc level=0"** - 9 times
8. BUG_ON at key_cache.c:475

## The Bug Location
```c
// key_cache.c:472-475
struct bkey_s_c btree_k = bkey_try(bch2_btree_iter_peek_slot(&b_iter));

/* Check that we're not violating cache coherency rules: */
BUG_ON(bkey_deleted(btree_k.k));
```

## What's Happening
`btree_key_cache_flush_pos()` flushes dirty key cache entries to the btree:
1. Creates two iterators: `b_iter` (btree), `c_iter` (key cache)
2. `b_iter.flags &= ~BTREE_ITER_with_key_cache` - bypass key cache for btree lookup
3. Looks up same position in btree with `bch2_btree_iter_peek_slot(&b_iter)`
4. Asserts the btree key is not deleted (cache coherency check)

**The invariant:** If we have a dirty key cache entry for position X, the btree must have a non-deleted key at X.

## Root Cause
The btree corruption ("btree node header doesn't match ptr") means we're reading from wrong/corrupted btree nodes. The topology error is detected by `btree_check_header()` -> `btree_bad_header()` -> `bch2_fs_topology_error()`, but execution continues. The corrupted btree returns wrong data (deleted key) when the key cache flush looks up the position.

## Why It's a Problem
- The topology error is logged but doesn't prevent further operations
- The subsequent BUG_ON doesn't know about the earlier corruption
- Result: kernel panic instead of graceful degradation

## Call Stack
```
btree_key_cache_flush_pos+0x643/0x650
bch2_btree_key_cache_journal_flush+0x147/0x2a0
journal_flush_pins+0x1f5/0x3d0
journal_flush_done+0x66/0x270
bch2_journal_flush_pins+0xbc/0xf0
__bch2_fs_recovery+0x8ae/0xcb0
bch2_fs_recovery+0x28/0xb0
__bch2_fs_start+0x32c/0x5b0
...
```

## Potential Fix Direction
Convert BUG_ON to error return. The caller already handles errors:
```c
// key_cache.c:557-560
ret = lockrestart_do(trans, btree_key_cache_flush_pos(...));
bch2_fs_fatal_err_on(ret &&
    !bch2_err_matches(ret, BCH_ERR_journal_reclaim_would_deadlock) &&
    !bch2_journal_error(j), c,
    "flushing key cache: %s", bch2_err_str(ret));
```

So an error return would still cause a fatal error, but:
1. Controlled shutdown instead of kernel panic
2. Clearer error message
3. Filesystem goes to emergency read-only instead of crashing

## Questions for Kent
1. Is there a scenario where this BUG_ON could fire during normal operation (not corruption)?
2. Should we add a new error code like `BCH_ERR_btree_key_cache_coherency` or use an existing one?
3. Should the topology error detection prevent operations that depend on btree correctness?

## Related Issues
- #1108: Allocator stuck during journal replay (similar recovery scenario)
- #1105: Allocator stuck on asymmetric multi-device filesystem
Remove find_context_files — identity comes from memory nodes Deleted the directory-walking CLAUDE.md/POC.md loader. Identity now comes entirely from personality_nodes in the memory graph. Simplified: - assemble_context_message() takes just personality_nodes - Removed config_file_count/memory_file_count tracking - reload_for_model() → reload_context() (no longer model-specific) Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> 2026-04-15 03:06:23 -04:00			`# Issue #1107 Analysis: kernel BUG at key_cache.c:475`

			`## Summary`
			`BUG_ON fires during degraded mount with 8 disks when flushing key cache during recovery.`

			`## Timeline from dmesg`
			`1. Unclean shutdown recovery begins`
			`2. "journal bucket seqs not monotonic" on 5 devices`
			`3. 22M journal keys replayed (29M read, 22M after compaction)`
			4. `check_allocations` finds buckets "missing in alloc btree"
			`5. Goes read-write`
			6. EC stripe read errors spam (`__ec_stripe_create: error reading stripe`)
			`7. "btree node header doesn't match ptr: btree=alloc level=0" - 9 times`
			`8. BUG_ON at key_cache.c:475`

			`## The Bug Location`
			```c
			`// key_cache.c:472-475`
			`struct bkey_s_c btree_k = bkey_try(bch2_btree_iter_peek_slot(&b_iter));`

			`/* Check that we're not violating cache coherency rules: */`
			`BUG_ON(bkey_deleted(btree_k.k));`
			```

			`## What's Happening`
			`btree_key_cache_flush_pos()` flushes dirty key cache entries to the btree:
			1. Creates two iterators: `b_iter` (btree), `c_iter` (key cache)
			2. `b_iter.flags &= ~BTREE_ITER_with_key_cache` - bypass key cache for btree lookup
			3. Looks up same position in btree with `bch2_btree_iter_peek_slot(&b_iter)`
			`4. Asserts the btree key is not deleted (cache coherency check)`

			`The invariant: If we have a dirty key cache entry for position X, the btree must have a non-deleted key at X.`

			`## Root Cause`
			The btree corruption ("btree node header doesn't match ptr") means we're reading from wrong/corrupted btree nodes. The topology error is detected by `btree_check_header()` -> `btree_bad_header()` -> `bch2_fs_topology_error()`, but execution continues. The corrupted btree returns wrong data (deleted key) when the key cache flush looks up the position.

			`## Why It's a Problem`
			`- The topology error is logged but doesn't prevent further operations`
			`- The subsequent BUG_ON doesn't know about the earlier corruption`
			`- Result: kernel panic instead of graceful degradation`

			`## Call Stack`
			```
			`btree_key_cache_flush_pos+0x643/0x650`
			`bch2_btree_key_cache_journal_flush+0x147/0x2a0`
			`journal_flush_pins+0x1f5/0x3d0`
			`journal_flush_done+0x66/0x270`
			`bch2_journal_flush_pins+0xbc/0xf0`
			`__bch2_fs_recovery+0x8ae/0xcb0`
			`bch2_fs_recovery+0x28/0xb0`
			`__bch2_fs_start+0x32c/0x5b0`
			`...`
			```

			`## Potential Fix Direction`
			`Convert BUG_ON to error return. The caller already handles errors:`
			```c
			`// key_cache.c:557-560`
			`ret = lockrestart_do(trans, btree_key_cache_flush_pos(...));`
			`bch2_fs_fatal_err_on(ret &&`
			`!bch2_err_matches(ret, BCH_ERR_journal_reclaim_would_deadlock) &&`
			`!bch2_journal_error(j), c,`
			`"flushing key cache: %s", bch2_err_str(ret));`
			```

			`So an error return would still cause a fatal error, but:`
			`1. Controlled shutdown instead of kernel panic`
			`2. Clearer error message`
			`3. Filesystem goes to emergency read-only instead of crashing`

			`## Questions for Kent`
			`1. Is there a scenario where this BUG_ON could fire during normal operation (not corruption)?`
			2. Should we add a new error code like `BCH_ERR_btree_key_cache_coherency` or use an existing one?
			`3. Should the topology error detection prevent operations that depend on btree correctness?`

			`## Related Issues`
			`- #1108: Allocator stuck during journal replay (similar recovery scenario)`
			`- #1105: Allocator stuck on asymmetric multi-device filesystem`