forked from kent/consciousness
Deleted the directory-walking CLAUDE.md/POC.md loader. Identity now comes entirely from personality_nodes in the memory graph. Simplified: - assemble_context_message() takes just personality_nodes - Removed config_file_count/memory_file_count tracking - reload_for_model() → reload_context() (no longer model-specific) Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
153 lines
6 KiB
Markdown
153 lines
6 KiB
Markdown
# Discard Write Buffer Bug Investigation (2026-04-14)
|
|
|
|
## Symptom
|
|
Spurious "bucket incorrectly set in need_discard btree" errors during fsck.
|
|
The check code sees a need_discard key that should have been deleted.
|
|
|
|
## Key Data Points (from Kent's tracing)
|
|
- Write buffer flushed at seq 436
|
|
- need_discard DELETE was at seq 432
|
|
- After transaction restart, peek_slot STILL returns the old key
|
|
|
|
## Code Flow
|
|
|
|
### Check Code (alloc/check.c:167-179)
|
|
```c
|
|
bch2_btree_iter_set_pos(discard_iter,
|
|
POS(a->v.journal_seq_empty, bucket_to_u64(alloc_k.k->p)));
|
|
k = bkey_try(bch2_btree_iter_peek_slot(discard_iter));
|
|
|
|
bool is_discarded = a->v.data_type == BCH_DATA_need_discard;
|
|
if (!!k.k->type != is_discarded) {
|
|
try(bch2_btree_write_buffer_maybe_flush(trans, alloc_k, last_flushed));
|
|
// After restart, should re-execute from function start with fresh data
|
|
|
|
if (need_discard_or_freespace_err_on(...))
|
|
// Log error and repair
|
|
}
|
|
```
|
|
|
|
### Trigger Code (alloc/background.c:1381-1386)
|
|
```c
|
|
if (statechange(a->data_type == BCH_DATA_need_discard) ||
|
|
(old_a->data_type == BCH_DATA_need_discard &&
|
|
old_a->journal_seq_empty != new_a->journal_seq_empty)) {
|
|
try(bch2_bucket_do_discard_index(trans, old, old_a, false)); // DELETE
|
|
try(bch2_bucket_do_discard_index(trans, new.s_c, new_a, true)); // SET (returns early if not need_discard)
|
|
}
|
|
```
|
|
|
|
## Ruled Out
|
|
|
|
1. **Iterator caching**: After `bch2_trans_begin`, paths are marked NEED_RELOCK,
|
|
subsequent peek_slot re-traverses and gets fresh data.
|
|
|
|
2. **Write buffer coalescing**: Keys at same position are coalesced with later key winning.
|
|
DELETE at seq 432 would only be overwritten by a later SET at same position.
|
|
|
|
3. **Position mismatch (simple case)**: DELETE uses `old_a->journal_seq_empty`,
|
|
check uses current `journal_seq_empty`. When transitioning out of need_discard
|
|
without journal_seq_empty changing, these match.
|
|
|
|
4. **Journal fetch boundaries**: Flush at seq 436 uses `journal_cur_seq()` as max_seq,
|
|
iteration is `seq <= max_seq` (inclusive), so seq 432 is included.
|
|
|
|
5. **bch2_btree_bset_insert_key DELETE handling**: If key exists, it's marked deleted.
|
|
If key doesn't exist, DELETE is no-op. Neither explains seeing the key after flush.
|
|
|
|
## Remaining Hypotheses
|
|
|
|
1. **Position mismatch (complex case)**: If journal_seq_empty changed between
|
|
key creation and the DELETE, they'd be at different positions. The trigger
|
|
handles this at lines 1382-1383, but there might be an edge case.
|
|
|
|
2. **Multiple keys**: Could there be multiple need_discard keys for the same bucket
|
|
at different journal_seq_empty positions, with only some being deleted?
|
|
|
|
3. **Write buffer key skipped**: Some condition in wb_flush_one causing the key
|
|
to not be applied to the btree.
|
|
|
|
4. **Btree node not visible**: Some caching or sequencing issue where the btree
|
|
node modification isn't visible to the subsequent lookup.
|
|
|
|
## Recent Relevant Commit
|
|
```
|
|
fe43d8a0c1bb bcachefs: Reindex need_discard btree by journal seq
|
|
```
|
|
Changed key format from `POS(dev_idx, bucket)` to `POS(journal_seq_empty, bucket_to_u64(bucket))`.
|
|
This is when the write_buffer_maybe_flush was added to the check code.
|
|
|
|
## Deeper Analysis (2026-04-14 continued)
|
|
|
|
### Write Buffer Flush Flow
|
|
1. `maybe_flush` calls `btree_write_buffer_flush_seq(trans, journal_cur_seq())`
|
|
2. This fetches keys from journal up to max_seq via `fetch_wb_keys_from_journal`
|
|
3. Keys are sorted, deduplicated (later key wins), then flushed via `wb_flush_one`
|
|
4. Returns `transaction_restart_write_buffer_flush`
|
|
5. Second call with same key returns 0 without flushing again
|
|
|
|
### Key Coalescing Logic (write_buffer.c:430-442)
|
|
When two keys at same position found during sort:
|
|
- Earlier key (lower journal_seq) gets `journal_seq = 0` (skipped)
|
|
- Later key is kept and flushed
|
|
- DELETE at seq 432 SHOULD overwrite SET at earlier seq
|
|
|
|
### DELETE Handling (commit.c:199-201)
|
|
```c
|
|
if (bkey_deleted(&insert->k) && !k)
|
|
return false; // DELETE at empty position is no-op
|
|
```
|
|
DELETE only removes an existing key. If key doesn't exist in btree, DELETE is no-op.
|
|
|
|
### Still Unexplained
|
|
After flush+restart, `peek_slot` at `POS(journal_seq_empty, bucket)` still returns the key.
|
|
Either:
|
|
1. DELETE was written to different position than lookup
|
|
2. DELETE was skipped during flush
|
|
3. A new SET was written after the DELETE
|
|
4. Something preventing btree node modification visibility
|
|
|
|
### Current Debug Output
|
|
Kent added logging to show:
|
|
- Key value (`k`) when mismatch detected in check.c
|
|
- Journal seq and referring key (`alloc_k`) in maybe_flush
|
|
|
|
## Root Cause Identified (2026-04-14 evening)
|
|
|
|
Kent identified the actual root cause: **write buffer btrees have a synchronization
|
|
issue with journal replay**.
|
|
|
|
### The Problem
|
|
|
|
During journal replay, the fs is live, rw, and multithreaded. Other threads might
|
|
update a key that overwrites something journal replay hasn't replayed yet.
|
|
|
|
For **non-write-buffer btrees**, this is solved by marking the key in the journal
|
|
replay list as overwritten while holding the btree node write lock. The lock
|
|
provides synchronization.
|
|
|
|
For **write buffer btrees**, there's no btree node lock at the right granularity.
|
|
The write buffer commit path doesn't hold a btree node lock.
|
|
|
|
### Why need_discard Can't Use the Previous Workaround
|
|
|
|
Previously: don't use write buffer during journal replay, do normal btree updates.
|
|
|
|
But `need_discard` MUST use the write buffer because:
|
|
1. Updates happen in the atomic trigger (holding btree node write lock)
|
|
2. Journal seq isn't known until that point
|
|
3. Can't do a normal btree update while holding another node's write lock
|
|
|
|
### Fix Direction
|
|
|
|
The proper place for the check is transaction commit time, in
|
|
`bch2_drop_overwrites_from_journal()`.
|
|
|
|
Need better synchronization for `journal_key.overwritten` that doesn't rely on the
|
|
btree node lock. Challenge: new locks risk deadlock with existing lock hierarchy.
|
|
|
|
Potential tool: `bch2_trans_mutex_lock()` integrates with transaction deadlock
|
|
detection, could protect the journal replay key list.
|
|
|
|
## Status
|
|
Root cause identified. Implementation of fix pending.
|