Remove find_context_files — identity comes from memory nodes

Deleted the directory-walking CLAUDE.md/POC.md loader. Identity now
comes entirely from personality_nodes in the memory graph.

Simplified:
- assemble_context_message() takes just personality_nodes
- Removed config_file_count/memory_file_count tracking
- reload_for_model() → reload_context() (no longer model-specific)

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This commit is contained in:
Kent Overstreet 2026-04-15 03:06:23 -04:00
parent e847a313b4
commit fc978e2f2e
14 changed files with 779 additions and 107 deletions

View file

@ -0,0 +1,153 @@
# Discard Write Buffer Bug Investigation (2026-04-14)
## Symptom
Spurious "bucket incorrectly set in need_discard btree" errors during fsck.
The check code sees a need_discard key that should have been deleted.
## Key Data Points (from Kent's tracing)
- Write buffer flushed at seq 436
- need_discard DELETE was at seq 432
- After transaction restart, peek_slot STILL returns the old key
## Code Flow
### Check Code (alloc/check.c:167-179)
```c
bch2_btree_iter_set_pos(discard_iter,
POS(a->v.journal_seq_empty, bucket_to_u64(alloc_k.k->p)));
k = bkey_try(bch2_btree_iter_peek_slot(discard_iter));
bool is_discarded = a->v.data_type == BCH_DATA_need_discard;
if (!!k.k->type != is_discarded) {
try(bch2_btree_write_buffer_maybe_flush(trans, alloc_k, last_flushed));
// After restart, should re-execute from function start with fresh data
if (need_discard_or_freespace_err_on(...))
// Log error and repair
}
```
### Trigger Code (alloc/background.c:1381-1386)
```c
if (statechange(a->data_type == BCH_DATA_need_discard) ||
(old_a->data_type == BCH_DATA_need_discard &&
old_a->journal_seq_empty != new_a->journal_seq_empty)) {
try(bch2_bucket_do_discard_index(trans, old, old_a, false)); // DELETE
try(bch2_bucket_do_discard_index(trans, new.s_c, new_a, true)); // SET (returns early if not need_discard)
}
```
## Ruled Out
1. **Iterator caching**: After `bch2_trans_begin`, paths are marked NEED_RELOCK,
subsequent peek_slot re-traverses and gets fresh data.
2. **Write buffer coalescing**: Keys at same position are coalesced with later key winning.
DELETE at seq 432 would only be overwritten by a later SET at same position.
3. **Position mismatch (simple case)**: DELETE uses `old_a->journal_seq_empty`,
check uses current `journal_seq_empty`. When transitioning out of need_discard
without journal_seq_empty changing, these match.
4. **Journal fetch boundaries**: Flush at seq 436 uses `journal_cur_seq()` as max_seq,
iteration is `seq <= max_seq` (inclusive), so seq 432 is included.
5. **bch2_btree_bset_insert_key DELETE handling**: If key exists, it's marked deleted.
If key doesn't exist, DELETE is no-op. Neither explains seeing the key after flush.
## Remaining Hypotheses
1. **Position mismatch (complex case)**: If journal_seq_empty changed between
key creation and the DELETE, they'd be at different positions. The trigger
handles this at lines 1382-1383, but there might be an edge case.
2. **Multiple keys**: Could there be multiple need_discard keys for the same bucket
at different journal_seq_empty positions, with only some being deleted?
3. **Write buffer key skipped**: Some condition in wb_flush_one causing the key
to not be applied to the btree.
4. **Btree node not visible**: Some caching or sequencing issue where the btree
node modification isn't visible to the subsequent lookup.
## Recent Relevant Commit
```
fe43d8a0c1bb bcachefs: Reindex need_discard btree by journal seq
```
Changed key format from `POS(dev_idx, bucket)` to `POS(journal_seq_empty, bucket_to_u64(bucket))`.
This is when the write_buffer_maybe_flush was added to the check code.
## Deeper Analysis (2026-04-14 continued)
### Write Buffer Flush Flow
1. `maybe_flush` calls `btree_write_buffer_flush_seq(trans, journal_cur_seq())`
2. This fetches keys from journal up to max_seq via `fetch_wb_keys_from_journal`
3. Keys are sorted, deduplicated (later key wins), then flushed via `wb_flush_one`
4. Returns `transaction_restart_write_buffer_flush`
5. Second call with same key returns 0 without flushing again
### Key Coalescing Logic (write_buffer.c:430-442)
When two keys at same position found during sort:
- Earlier key (lower journal_seq) gets `journal_seq = 0` (skipped)
- Later key is kept and flushed
- DELETE at seq 432 SHOULD overwrite SET at earlier seq
### DELETE Handling (commit.c:199-201)
```c
if (bkey_deleted(&insert->k) && !k)
return false; // DELETE at empty position is no-op
```
DELETE only removes an existing key. If key doesn't exist in btree, DELETE is no-op.
### Still Unexplained
After flush+restart, `peek_slot` at `POS(journal_seq_empty, bucket)` still returns the key.
Either:
1. DELETE was written to different position than lookup
2. DELETE was skipped during flush
3. A new SET was written after the DELETE
4. Something preventing btree node modification visibility
### Current Debug Output
Kent added logging to show:
- Key value (`k`) when mismatch detected in check.c
- Journal seq and referring key (`alloc_k`) in maybe_flush
## Root Cause Identified (2026-04-14 evening)
Kent identified the actual root cause: **write buffer btrees have a synchronization
issue with journal replay**.
### The Problem
During journal replay, the fs is live, rw, and multithreaded. Other threads might
update a key that overwrites something journal replay hasn't replayed yet.
For **non-write-buffer btrees**, this is solved by marking the key in the journal
replay list as overwritten while holding the btree node write lock. The lock
provides synchronization.
For **write buffer btrees**, there's no btree node lock at the right granularity.
The write buffer commit path doesn't hold a btree node lock.
### Why need_discard Can't Use the Previous Workaround
Previously: don't use write buffer during journal replay, do normal btree updates.
But `need_discard` MUST use the write buffer because:
1. Updates happen in the atomic trigger (holding btree node write lock)
2. Journal seq isn't known until that point
3. Can't do a normal btree update while holding another node's write lock
### Fix Direction
The proper place for the check is transaction commit time, in
`bch2_drop_overwrites_from_journal()`.
Need better synchronization for `journal_key.overwritten` that doesn't rely on the
btree node lock. Challenge: new locks risk deadlock with existing lock hierarchy.
Potential tool: `bch2_trans_mutex_lock()` integrates with transaction deadlock
detection, could protect the journal replay key list.
## Status
Root cause identified. Implementation of fix pending.