consciousness/research/discard-write-buffer-bug.md
Kent Overstreet fc978e2f2e Remove find_context_files — identity comes from memory nodes
Deleted the directory-walking CLAUDE.md/POC.md loader. Identity now
comes entirely from personality_nodes in the memory graph.

Simplified:
- assemble_context_message() takes just personality_nodes
- Removed config_file_count/memory_file_count tracking
- reload_for_model() → reload_context() (no longer model-specific)

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2026-04-15 03:11:27 -04:00

6 KiB

Discard Write Buffer Bug Investigation (2026-04-14)

Symptom

Spurious "bucket incorrectly set in need_discard btree" errors during fsck. The check code sees a need_discard key that should have been deleted.

Key Data Points (from Kent's tracing)

  • Write buffer flushed at seq 436
  • need_discard DELETE was at seq 432
  • After transaction restart, peek_slot STILL returns the old key

Code Flow

Check Code (alloc/check.c:167-179)

bch2_btree_iter_set_pos(discard_iter,
    POS(a->v.journal_seq_empty, bucket_to_u64(alloc_k.k->p)));
k = bkey_try(bch2_btree_iter_peek_slot(discard_iter));

bool is_discarded = a->v.data_type == BCH_DATA_need_discard;
if (!!k.k->type != is_discarded) {
    try(bch2_btree_write_buffer_maybe_flush(trans, alloc_k, last_flushed));
    // After restart, should re-execute from function start with fresh data
    
    if (need_discard_or_freespace_err_on(...))
        // Log error and repair
}

Trigger Code (alloc/background.c:1381-1386)

if (statechange(a->data_type == BCH_DATA_need_discard) ||
    (old_a->data_type == BCH_DATA_need_discard &&
     old_a->journal_seq_empty != new_a->journal_seq_empty)) {
    try(bch2_bucket_do_discard_index(trans, old, old_a, false));  // DELETE
    try(bch2_bucket_do_discard_index(trans, new.s_c, new_a, true)); // SET (returns early if not need_discard)
}

Ruled Out

  1. Iterator caching: After bch2_trans_begin, paths are marked NEED_RELOCK, subsequent peek_slot re-traverses and gets fresh data.

  2. Write buffer coalescing: Keys at same position are coalesced with later key winning. DELETE at seq 432 would only be overwritten by a later SET at same position.

  3. Position mismatch (simple case): DELETE uses old_a->journal_seq_empty, check uses current journal_seq_empty. When transitioning out of need_discard without journal_seq_empty changing, these match.

  4. Journal fetch boundaries: Flush at seq 436 uses journal_cur_seq() as max_seq, iteration is seq <= max_seq (inclusive), so seq 432 is included.

  5. bch2_btree_bset_insert_key DELETE handling: If key exists, it's marked deleted. If key doesn't exist, DELETE is no-op. Neither explains seeing the key after flush.

Remaining Hypotheses

  1. Position mismatch (complex case): If journal_seq_empty changed between key creation and the DELETE, they'd be at different positions. The trigger handles this at lines 1382-1383, but there might be an edge case.

  2. Multiple keys: Could there be multiple need_discard keys for the same bucket at different journal_seq_empty positions, with only some being deleted?

  3. Write buffer key skipped: Some condition in wb_flush_one causing the key to not be applied to the btree.

  4. Btree node not visible: Some caching or sequencing issue where the btree node modification isn't visible to the subsequent lookup.

Recent Relevant Commit

fe43d8a0c1bb bcachefs: Reindex need_discard btree by journal seq

Changed key format from POS(dev_idx, bucket) to POS(journal_seq_empty, bucket_to_u64(bucket)). This is when the write_buffer_maybe_flush was added to the check code.

Deeper Analysis (2026-04-14 continued)

Write Buffer Flush Flow

  1. maybe_flush calls btree_write_buffer_flush_seq(trans, journal_cur_seq())
  2. This fetches keys from journal up to max_seq via fetch_wb_keys_from_journal
  3. Keys are sorted, deduplicated (later key wins), then flushed via wb_flush_one
  4. Returns transaction_restart_write_buffer_flush
  5. Second call with same key returns 0 without flushing again

Key Coalescing Logic (write_buffer.c:430-442)

When two keys at same position found during sort:

  • Earlier key (lower journal_seq) gets journal_seq = 0 (skipped)
  • Later key is kept and flushed
  • DELETE at seq 432 SHOULD overwrite SET at earlier seq

DELETE Handling (commit.c:199-201)

if (bkey_deleted(&insert->k) && !k)
    return false;  // DELETE at empty position is no-op

DELETE only removes an existing key. If key doesn't exist in btree, DELETE is no-op.

Still Unexplained

After flush+restart, peek_slot at POS(journal_seq_empty, bucket) still returns the key. Either:

  1. DELETE was written to different position than lookup
  2. DELETE was skipped during flush
  3. A new SET was written after the DELETE
  4. Something preventing btree node modification visibility

Current Debug Output

Kent added logging to show:

  • Key value (k) when mismatch detected in check.c
  • Journal seq and referring key (alloc_k) in maybe_flush

Root Cause Identified (2026-04-14 evening)

Kent identified the actual root cause: write buffer btrees have a synchronization issue with journal replay.

The Problem

During journal replay, the fs is live, rw, and multithreaded. Other threads might update a key that overwrites something journal replay hasn't replayed yet.

For non-write-buffer btrees, this is solved by marking the key in the journal replay list as overwritten while holding the btree node write lock. The lock provides synchronization.

For write buffer btrees, there's no btree node lock at the right granularity. The write buffer commit path doesn't hold a btree node lock.

Why need_discard Can't Use the Previous Workaround

Previously: don't use write buffer during journal replay, do normal btree updates.

But need_discard MUST use the write buffer because:

  1. Updates happen in the atomic trigger (holding btree node write lock)
  2. Journal seq isn't known until that point
  3. Can't do a normal btree update while holding another node's write lock

Fix Direction

The proper place for the check is transaction commit time, in bch2_drop_overwrites_from_journal().

Need better synchronization for journal_key.overwritten that doesn't rely on the btree node lock. Challenge: new locks risk deadlock with existing lock hierarchy.

Potential tool: bch2_trans_mutex_lock() integrates with transaction deadlock detection, could protect the journal replay key list.

Status

Root cause identified. Implementation of fix pending.