consciousness/research/issue-1108-analysis.md
Kent Overstreet fc978e2f2e Remove find_context_files — identity comes from memory nodes
Deleted the directory-walking CLAUDE.md/POC.md loader. Identity now
comes entirely from personality_nodes in the memory graph.

Simplified:
- assemble_context_message() takes just personality_nodes
- Removed config_file_count/memory_file_count tracking
- reload_for_model() → reload_context() (no longer model-specific)

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2026-04-15 03:11:27 -04:00

2.7 KiB

Issue #1108 Analysis: Allocator stuck during journal replay

Summary

Allocator deadlocks during journal replay when NVMe metadata devices have too few free buckets to satisfy metadata_replicas=2 requirement.

The Problem

During journal replay, a btree node split requires allocation:

bch2_btree_update_start+0xc0d/0xcb0
bch2_btree_split_leaf+0x54/0x1c0
__bch2_trans_commit_error
bch2_journal_replay+0x2df/0x7d0

The allocator needs free buckets on two devices (for metadata_replicas=2), but:

  • Device vde: 1 free bucket, 9416 in need_discard, btree reserve = 2
  • Device vdf: 5109 free but 41681 in need_discard

The Infinite Wait Loop

In btree/interior.c:1347-1353:

do {
    ret = bch2_btree_reserve_get(trans, as, nr_nodes, req);
    if (!bch2_err_matches(ret, BCH_ERR_operation_blocked))
        break;
    bch2_trans_unlock(trans);
    bch2_wait_on_allocator(c, req, ret, &cl);
} while (1);

And __bch2_wait_on_allocator (foreground.c:1781-1792):

void __bch2_wait_on_allocator(struct bch_fs *c, struct alloc_request *req,
                              int err, struct closure *cl)
{
    unsigned t = allocator_wait_timeout(c);
    if (t && closure_sync_timeout(cl, t)) {
        c->allocator.last_stuck = jiffies;
        bch2_print_allocator_stuck(c, req, err);
    }
    closure_sync(cl);  // Waits forever
}

Why sysfs change doesn't help

The alloc_request was created with metadata_replicas from c->opts:

// interior.c:1309
READ_ONCE(c->opts.metadata_replicas)

Once waiting in closure_sync(), the request doesn't re-check current options. Changing metadata_replicas=1 via sysfs doesn't wake up or modify the existing waiting allocation.

Chicken-and-egg

  • metadata_replicas can't be set as mount option (error recommends sysfs)
  • sysfs requires mounted filesystem
  • filesystem can't mount because allocator is stuck

Potential Fixes

  1. Allow metadata_replicas as recovery mount option

    • Add to mount option parsing for emergency recovery scenarios
  2. Make stuck allocations restartable

    • When replica options change, wake waiting allocations to re-check
    • Store pointer to c->opts in alloc_request rather than snapshot value
  3. Process need_discard more aggressively

    • 9416 buckets stuck in need_discard on vde
    • If these were available, allocation would succeed
    • Discard processing during recovery should be prioritized
  4. Add timeout escape hatch

    • After N seconds stuck, check if options have changed
    • Or allow sysfs write to signal "abort current waiting allocations"
  • The need_discard stuck buckets may be related to the discard bug in the work queue
  • #1107 also shows recovery issues with corrupted state