# Discard Write Buffer Bug Investigation (2026-04-14) ## Symptom Spurious "bucket incorrectly set in need_discard btree" errors during fsck. The check code sees a need_discard key that should have been deleted. ## Key Data Points (from Kent's tracing) - Write buffer flushed at seq 436 - need_discard DELETE was at seq 432 - After transaction restart, peek_slot STILL returns the old key ## Code Flow ### Check Code (alloc/check.c:167-179) ```c bch2_btree_iter_set_pos(discard_iter, POS(a->v.journal_seq_empty, bucket_to_u64(alloc_k.k->p))); k = bkey_try(bch2_btree_iter_peek_slot(discard_iter)); bool is_discarded = a->v.data_type == BCH_DATA_need_discard; if (!!k.k->type != is_discarded) { try(bch2_btree_write_buffer_maybe_flush(trans, alloc_k, last_flushed)); // After restart, should re-execute from function start with fresh data if (need_discard_or_freespace_err_on(...)) // Log error and repair } ``` ### Trigger Code (alloc/background.c:1381-1386) ```c if (statechange(a->data_type == BCH_DATA_need_discard) || (old_a->data_type == BCH_DATA_need_discard && old_a->journal_seq_empty != new_a->journal_seq_empty)) { try(bch2_bucket_do_discard_index(trans, old, old_a, false)); // DELETE try(bch2_bucket_do_discard_index(trans, new.s_c, new_a, true)); // SET (returns early if not need_discard) } ``` ## Ruled Out 1. **Iterator caching**: After `bch2_trans_begin`, paths are marked NEED_RELOCK, subsequent peek_slot re-traverses and gets fresh data. 2. **Write buffer coalescing**: Keys at same position are coalesced with later key winning. DELETE at seq 432 would only be overwritten by a later SET at same position. 3. **Position mismatch (simple case)**: DELETE uses `old_a->journal_seq_empty`, check uses current `journal_seq_empty`. When transitioning out of need_discard without journal_seq_empty changing, these match. 4. **Journal fetch boundaries**: Flush at seq 436 uses `journal_cur_seq()` as max_seq, iteration is `seq <= max_seq` (inclusive), so seq 432 is included. 5. **bch2_btree_bset_insert_key DELETE handling**: If key exists, it's marked deleted. If key doesn't exist, DELETE is no-op. Neither explains seeing the key after flush. ## Remaining Hypotheses 1. **Position mismatch (complex case)**: If journal_seq_empty changed between key creation and the DELETE, they'd be at different positions. The trigger handles this at lines 1382-1383, but there might be an edge case. 2. **Multiple keys**: Could there be multiple need_discard keys for the same bucket at different journal_seq_empty positions, with only some being deleted? 3. **Write buffer key skipped**: Some condition in wb_flush_one causing the key to not be applied to the btree. 4. **Btree node not visible**: Some caching or sequencing issue where the btree node modification isn't visible to the subsequent lookup. ## Recent Relevant Commit ``` fe43d8a0c1bb bcachefs: Reindex need_discard btree by journal seq ``` Changed key format from `POS(dev_idx, bucket)` to `POS(journal_seq_empty, bucket_to_u64(bucket))`. This is when the write_buffer_maybe_flush was added to the check code. ## Deeper Analysis (2026-04-14 continued) ### Write Buffer Flush Flow 1. `maybe_flush` calls `btree_write_buffer_flush_seq(trans, journal_cur_seq())` 2. This fetches keys from journal up to max_seq via `fetch_wb_keys_from_journal` 3. Keys are sorted, deduplicated (later key wins), then flushed via `wb_flush_one` 4. Returns `transaction_restart_write_buffer_flush` 5. Second call with same key returns 0 without flushing again ### Key Coalescing Logic (write_buffer.c:430-442) When two keys at same position found during sort: - Earlier key (lower journal_seq) gets `journal_seq = 0` (skipped) - Later key is kept and flushed - DELETE at seq 432 SHOULD overwrite SET at earlier seq ### DELETE Handling (commit.c:199-201) ```c if (bkey_deleted(&insert->k) && !k) return false; // DELETE at empty position is no-op ``` DELETE only removes an existing key. If key doesn't exist in btree, DELETE is no-op. ### Still Unexplained After flush+restart, `peek_slot` at `POS(journal_seq_empty, bucket)` still returns the key. Either: 1. DELETE was written to different position than lookup 2. DELETE was skipped during flush 3. A new SET was written after the DELETE 4. Something preventing btree node modification visibility ### Current Debug Output Kent added logging to show: - Key value (`k`) when mismatch detected in check.c - Journal seq and referring key (`alloc_k`) in maybe_flush ## Root Cause Identified (2026-04-14 evening) Kent identified the actual root cause: **write buffer btrees have a synchronization issue with journal replay**. ### The Problem During journal replay, the fs is live, rw, and multithreaded. Other threads might update a key that overwrites something journal replay hasn't replayed yet. For **non-write-buffer btrees**, this is solved by marking the key in the journal replay list as overwritten while holding the btree node write lock. The lock provides synchronization. For **write buffer btrees**, there's no btree node lock at the right granularity. The write buffer commit path doesn't hold a btree node lock. ### Why need_discard Can't Use the Previous Workaround Previously: don't use write buffer during journal replay, do normal btree updates. But `need_discard` MUST use the write buffer because: 1. Updates happen in the atomic trigger (holding btree node write lock) 2. Journal seq isn't known until that point 3. Can't do a normal btree update while holding another node's write lock ### Fix Direction The proper place for the check is transaction commit time, in `bch2_drop_overwrites_from_journal()`. Need better synchronization for `journal_key.overwritten` that doesn't rely on the btree node lock. Challenge: new locks risk deadlock with existing lock hierarchy. Potential tool: `bch2_trans_mutex_lock()` integrates with transaction deadlock detection, could protect the journal replay key list. ## Status Root cause identified. Implementation of fix pending.