79 lines
3.1 KiB
Markdown
79 lines
3.1 KiB
Markdown
|
|
# Issue #1107 Analysis: kernel BUG at key_cache.c:475
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
BUG_ON fires during degraded mount with 8 disks when flushing key cache during recovery.
|
||
|
|
|
||
|
|
## Timeline from dmesg
|
||
|
|
1. Unclean shutdown recovery begins
|
||
|
|
2. "journal bucket seqs not monotonic" on 5 devices
|
||
|
|
3. 22M journal keys replayed (29M read, 22M after compaction)
|
||
|
|
4. `check_allocations` finds buckets "missing in alloc btree"
|
||
|
|
5. Goes read-write
|
||
|
|
6. EC stripe read errors spam (`__ec_stripe_create: error reading stripe`)
|
||
|
|
7. **"btree node header doesn't match ptr: btree=alloc level=0"** - 9 times
|
||
|
|
8. BUG_ON at key_cache.c:475
|
||
|
|
|
||
|
|
## The Bug Location
|
||
|
|
```c
|
||
|
|
// key_cache.c:472-475
|
||
|
|
struct bkey_s_c btree_k = bkey_try(bch2_btree_iter_peek_slot(&b_iter));
|
||
|
|
|
||
|
|
/* Check that we're not violating cache coherency rules: */
|
||
|
|
BUG_ON(bkey_deleted(btree_k.k));
|
||
|
|
```
|
||
|
|
|
||
|
|
## What's Happening
|
||
|
|
`btree_key_cache_flush_pos()` flushes dirty key cache entries to the btree:
|
||
|
|
1. Creates two iterators: `b_iter` (btree), `c_iter` (key cache)
|
||
|
|
2. `b_iter.flags &= ~BTREE_ITER_with_key_cache` - bypass key cache for btree lookup
|
||
|
|
3. Looks up same position in btree with `bch2_btree_iter_peek_slot(&b_iter)`
|
||
|
|
4. Asserts the btree key is not deleted (cache coherency check)
|
||
|
|
|
||
|
|
**The invariant:** If we have a dirty key cache entry for position X, the btree must have a non-deleted key at X.
|
||
|
|
|
||
|
|
## Root Cause
|
||
|
|
The btree corruption ("btree node header doesn't match ptr") means we're reading from wrong/corrupted btree nodes. The topology error is detected by `btree_check_header()` -> `btree_bad_header()` -> `bch2_fs_topology_error()`, but execution continues. The corrupted btree returns wrong data (deleted key) when the key cache flush looks up the position.
|
||
|
|
|
||
|
|
## Why It's a Problem
|
||
|
|
- The topology error is logged but doesn't prevent further operations
|
||
|
|
- The subsequent BUG_ON doesn't know about the earlier corruption
|
||
|
|
- Result: kernel panic instead of graceful degradation
|
||
|
|
|
||
|
|
## Call Stack
|
||
|
|
```
|
||
|
|
btree_key_cache_flush_pos+0x643/0x650
|
||
|
|
bch2_btree_key_cache_journal_flush+0x147/0x2a0
|
||
|
|
journal_flush_pins+0x1f5/0x3d0
|
||
|
|
journal_flush_done+0x66/0x270
|
||
|
|
bch2_journal_flush_pins+0xbc/0xf0
|
||
|
|
__bch2_fs_recovery+0x8ae/0xcb0
|
||
|
|
bch2_fs_recovery+0x28/0xb0
|
||
|
|
__bch2_fs_start+0x32c/0x5b0
|
||
|
|
...
|
||
|
|
```
|
||
|
|
|
||
|
|
## Potential Fix Direction
|
||
|
|
Convert BUG_ON to error return. The caller already handles errors:
|
||
|
|
```c
|
||
|
|
// key_cache.c:557-560
|
||
|
|
ret = lockrestart_do(trans, btree_key_cache_flush_pos(...));
|
||
|
|
bch2_fs_fatal_err_on(ret &&
|
||
|
|
!bch2_err_matches(ret, BCH_ERR_journal_reclaim_would_deadlock) &&
|
||
|
|
!bch2_journal_error(j), c,
|
||
|
|
"flushing key cache: %s", bch2_err_str(ret));
|
||
|
|
```
|
||
|
|
|
||
|
|
So an error return would still cause a fatal error, but:
|
||
|
|
1. Controlled shutdown instead of kernel panic
|
||
|
|
2. Clearer error message
|
||
|
|
3. Filesystem goes to emergency read-only instead of crashing
|
||
|
|
|
||
|
|
## Questions for Kent
|
||
|
|
1. Is there a scenario where this BUG_ON could fire during normal operation (not corruption)?
|
||
|
|
2. Should we add a new error code like `BCH_ERR_btree_key_cache_coherency` or use an existing one?
|
||
|
|
3. Should the topology error detection prevent operations that depend on btree correctness?
|
||
|
|
|
||
|
|
## Related Issues
|
||
|
|
- #1108: Allocator stuck during journal replay (similar recovery scenario)
|
||
|
|
- #1105: Allocator stuck on asymmetric multi-device filesystem
|