consciousness/research/issue-1108-analysis.md

# Issue #1108 Analysis: Allocator stuck during journal replay

## Summary
Allocator deadlocks during journal replay when NVMe metadata devices have too few free buckets to satisfy `metadata_replicas=2` requirement.

## The Problem
During journal replay, a btree node split requires allocation:
```
bch2_btree_update_start+0xc0d/0xcb0
bch2_btree_split_leaf+0x54/0x1c0
__bch2_trans_commit_error
bch2_journal_replay+0x2df/0x7d0
```

The allocator needs free buckets on two devices (for `metadata_replicas=2`), but:
- Device vde: 1 free bucket, 9416 in `need_discard`, btree reserve = 2
- Device vdf: 5109 free but 41681 in `need_discard`

## The Infinite Wait Loop
In `btree/interior.c:1347-1353`:
```c
do {
    ret = bch2_btree_reserve_get(trans, as, nr_nodes, req);
    if (!bch2_err_matches(ret, BCH_ERR_operation_blocked))
        break;
    bch2_trans_unlock(trans);
    bch2_wait_on_allocator(c, req, ret, &cl);
} while (1);
```

And `__bch2_wait_on_allocator` (foreground.c:1781-1792):
```c
void __bch2_wait_on_allocator(struct bch_fs *c, struct alloc_request *req,
                              int err, struct closure *cl)
{
    unsigned t = allocator_wait_timeout(c);
    if (t && closure_sync_timeout(cl, t)) {
        c->allocator.last_stuck = jiffies;
        bch2_print_allocator_stuck(c, req, err);
    }
    closure_sync(cl);  // Waits forever
}
```

## Why sysfs change doesn't help
The `alloc_request` was created with `metadata_replicas` from `c->opts`:
```c
// interior.c:1309
READ_ONCE(c->opts.metadata_replicas)
```

Once waiting in `closure_sync()`, the request doesn't re-check current options. Changing `metadata_replicas=1` via sysfs doesn't wake up or modify the existing waiting allocation.

## Chicken-and-egg
- `metadata_replicas` can't be set as mount option (error recommends sysfs)
- sysfs requires mounted filesystem
- filesystem can't mount because allocator is stuck

## Potential Fixes

1. **Allow `metadata_replicas` as recovery mount option**
   - Add to mount option parsing for emergency recovery scenarios

2. **Make stuck allocations restartable**
   - When replica options change, wake waiting allocations to re-check
   - Store pointer to `c->opts` in alloc_request rather than snapshot value

3. **Process need_discard more aggressively**
   - 9416 buckets stuck in `need_discard` on vde
   - If these were available, allocation would succeed
   - Discard processing during recovery should be prioritized

4. **Add timeout escape hatch**
   - After N seconds stuck, check if options have changed
   - Or allow sysfs write to signal "abort current waiting allocations"

## Related
- The `need_discard` stuck buckets may be related to the discard bug in the work queue
- #1107 also shows recovery issues with corrupted state
Remove find_context_files — identity comes from memory nodes Deleted the directory-walking CLAUDE.md/POC.md loader. Identity now comes entirely from personality_nodes in the memory graph. Simplified: - assemble_context_message() takes just personality_nodes - Removed config_file_count/memory_file_count tracking - reload_for_model() → reload_context() (no longer model-specific) Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> 2026-04-15 03:06:23 -04:00			`# Issue #1108 Analysis: Allocator stuck during journal replay`

			`## Summary`
			Allocator deadlocks during journal replay when NVMe metadata devices have too few free buckets to satisfy `metadata_replicas=2` requirement.

			`## The Problem`
			`During journal replay, a btree node split requires allocation:`
			```
			`bch2_btree_update_start+0xc0d/0xcb0`
			`bch2_btree_split_leaf+0x54/0x1c0`
			`__bch2_trans_commit_error`
			`bch2_journal_replay+0x2df/0x7d0`
			```

			The allocator needs free buckets on two devices (for `metadata_replicas=2`), but:
			- Device vde: 1 free bucket, 9416 in `need_discard`, btree reserve = 2
			- Device vdf: 5109 free but 41681 in `need_discard`

			`## The Infinite Wait Loop`
			In `btree/interior.c:1347-1353`:
			```c
			`do {`
			`ret = bch2_btree_reserve_get(trans, as, nr_nodes, req);`
			`if (!bch2_err_matches(ret, BCH_ERR_operation_blocked))`
			`break;`
			`bch2_trans_unlock(trans);`
			`bch2_wait_on_allocator(c, req, ret, &cl);`
			`} while (1);`
			```

			And `__bch2_wait_on_allocator` (foreground.c:1781-1792):
			```c
			`void __bch2_wait_on_allocator(struct bch_fs c, struct alloc_request req,`
			`int err, struct closure *cl)`
			`{`
			`unsigned t = allocator_wait_timeout(c);`
			`if (t && closure_sync_timeout(cl, t)) {`
			`c->allocator.last_stuck = jiffies;`
			`bch2_print_allocator_stuck(c, req, err);`
			`}`
			`closure_sync(cl); // Waits forever`
			`}`
			```

			`## Why sysfs change doesn't help`
			The `alloc_request` was created with `metadata_replicas` from `c->opts`:
			```c
			`// interior.c:1309`
			`READ_ONCE(c->opts.metadata_replicas)`
			```

			Once waiting in `closure_sync()`, the request doesn't re-check current options. Changing `metadata_replicas=1` via sysfs doesn't wake up or modify the existing waiting allocation.

			`## Chicken-and-egg`
			- `metadata_replicas` can't be set as mount option (error recommends sysfs)
			`- sysfs requires mounted filesystem`
			`- filesystem can't mount because allocator is stuck`

			`## Potential Fixes`

			1. Allow `metadata_replicas` as recovery mount option
			`- Add to mount option parsing for emergency recovery scenarios`

			`2. Make stuck allocations restartable`
			`- When replica options change, wake waiting allocations to re-check`
			- Store pointer to `c->opts` in alloc_request rather than snapshot value

			`3. Process need_discard more aggressively`
			- 9416 buckets stuck in `need_discard` on vde
			`- If these were available, allocation would succeed`
			`- Discard processing during recovery should be prioritized`

			`4. Add timeout escape hatch`
			`- After N seconds stuck, check if options have changed`
			`- Or allow sysfs write to signal "abort current waiting allocations"`

			`## Related`
			- The `need_discard` stuck buckets may be related to the discard bug in the work queue
			`- #1107 also shows recovery issues with corrupted state`