# Discard Write Buffer Bug Investigation (2026-04-14)

## Symptom
Spurious "bucket incorrectly set in need_discard btree" errors during fsck.
The check code sees a need_discard key that should have been deleted.

## Key Data Points (from Kent's tracing)
- Write buffer flushed at seq 436
- need_discard DELETE was at seq 432
- After transaction restart, peek_slot STILL returns the old key

## Code Flow

### Check Code (alloc/check.c:167-179)
```c
bch2_btree_iter_set_pos(discard_iter,
    POS(a->v.journal_seq_empty, bucket_to_u64(alloc_k.k->p)));
k = bkey_try(bch2_btree_iter_peek_slot(discard_iter));

bool is_discarded = a->v.data_type == BCH_DATA_need_discard;
if (!!k.k->type != is_discarded) {
    try(bch2_btree_write_buffer_maybe_flush(trans, alloc_k, last_flushed));
    // After restart, should re-execute from function start with fresh data
    
    if (need_discard_or_freespace_err_on(...))
        // Log error and repair
}
```

### Trigger Code (alloc/background.c:1381-1386)
```c
if (statechange(a->data_type == BCH_DATA_need_discard) ||
    (old_a->data_type == BCH_DATA_need_discard &&
     old_a->journal_seq_empty != new_a->journal_seq_empty)) {
    try(bch2_bucket_do_discard_index(trans, old, old_a, false));  // DELETE
    try(bch2_bucket_do_discard_index(trans, new.s_c, new_a, true)); // SET (returns early if not need_discard)
}
```

## Ruled Out

1. **Iterator caching**: After `bch2_trans_begin`, paths are marked NEED_RELOCK, 
   subsequent peek_slot re-traverses and gets fresh data.

2. **Write buffer coalescing**: Keys at same position are coalesced with later key winning.
   DELETE at seq 432 would only be overwritten by a later SET at same position.

3. **Position mismatch (simple case)**: DELETE uses `old_a->journal_seq_empty`,
   check uses current `journal_seq_empty`. When transitioning out of need_discard
   without journal_seq_empty changing, these match.

4. **Journal fetch boundaries**: Flush at seq 436 uses `journal_cur_seq()` as max_seq,
   iteration is `seq <= max_seq` (inclusive), so seq 432 is included.

5. **bch2_btree_bset_insert_key DELETE handling**: If key exists, it's marked deleted.
   If key doesn't exist, DELETE is no-op. Neither explains seeing the key after flush.

## Remaining Hypotheses

1. **Position mismatch (complex case)**: If journal_seq_empty changed between
   key creation and the DELETE, they'd be at different positions. The trigger
   handles this at lines 1382-1383, but there might be an edge case.

2. **Multiple keys**: Could there be multiple need_discard keys for the same bucket
   at different journal_seq_empty positions, with only some being deleted?

3. **Write buffer key skipped**: Some condition in wb_flush_one causing the key
   to not be applied to the btree.

4. **Btree node not visible**: Some caching or sequencing issue where the btree
   node modification isn't visible to the subsequent lookup.

## Recent Relevant Commit
```
fe43d8a0c1bb bcachefs: Reindex need_discard btree by journal seq
```
Changed key format from `POS(dev_idx, bucket)` to `POS(journal_seq_empty, bucket_to_u64(bucket))`.
This is when the write_buffer_maybe_flush was added to the check code.

## Deeper Analysis (2026-04-14 continued)

### Write Buffer Flush Flow
1. `maybe_flush` calls `btree_write_buffer_flush_seq(trans, journal_cur_seq())`
2. This fetches keys from journal up to max_seq via `fetch_wb_keys_from_journal`
3. Keys are sorted, deduplicated (later key wins), then flushed via `wb_flush_one`
4. Returns `transaction_restart_write_buffer_flush`
5. Second call with same key returns 0 without flushing again

### Key Coalescing Logic (write_buffer.c:430-442)
When two keys at same position found during sort:
- Earlier key (lower journal_seq) gets `journal_seq = 0` (skipped)
- Later key is kept and flushed
- DELETE at seq 432 SHOULD overwrite SET at earlier seq

### DELETE Handling (commit.c:199-201)
```c
if (bkey_deleted(&insert->k) && !k)
    return false;  // DELETE at empty position is no-op
```
DELETE only removes an existing key. If key doesn't exist in btree, DELETE is no-op.

### Still Unexplained
After flush+restart, `peek_slot` at `POS(journal_seq_empty, bucket)` still returns the key.
Either:
1. DELETE was written to different position than lookup
2. DELETE was skipped during flush
3. A new SET was written after the DELETE
4. Something preventing btree node modification visibility

### Current Debug Output
Kent added logging to show:
- Key value (`k`) when mismatch detected in check.c
- Journal seq and referring key (`alloc_k`) in maybe_flush

## Root Cause Identified (2026-04-14 evening)

Kent identified the actual root cause: **write buffer btrees have a synchronization 
issue with journal replay**.

### The Problem

During journal replay, the fs is live, rw, and multithreaded. Other threads might
update a key that overwrites something journal replay hasn't replayed yet.

For **non-write-buffer btrees**, this is solved by marking the key in the journal
replay list as overwritten while holding the btree node write lock. The lock
provides synchronization.

For **write buffer btrees**, there's no btree node lock at the right granularity.
The write buffer commit path doesn't hold a btree node lock.

### Why need_discard Can't Use the Previous Workaround

Previously: don't use write buffer during journal replay, do normal btree updates.

But `need_discard` MUST use the write buffer because:
1. Updates happen in the atomic trigger (holding btree node write lock)
2. Journal seq isn't known until that point
3. Can't do a normal btree update while holding another node's write lock

### Fix Direction

The proper place for the check is transaction commit time, in 
`bch2_drop_overwrites_from_journal()`.

Need better synchronization for `journal_key.overwritten` that doesn't rely on the
btree node lock. Challenge: new locks risk deadlock with existing lock hierarchy.

Potential tool: `bch2_trans_mutex_lock()` integrates with transaction deadlock
detection, could protect the journal replay key list.

## Status
Root cause identified. Implementation of fix pending.