# Issue #1108 Analysis: Allocator stuck during journal replay ## Summary Allocator deadlocks during journal replay when NVMe metadata devices have too few free buckets to satisfy `metadata_replicas=2` requirement. ## The Problem During journal replay, a btree node split requires allocation: ``` bch2_btree_update_start+0xc0d/0xcb0 bch2_btree_split_leaf+0x54/0x1c0 __bch2_trans_commit_error bch2_journal_replay+0x2df/0x7d0 ``` The allocator needs free buckets on two devices (for `metadata_replicas=2`), but: - Device vde: 1 free bucket, 9416 in `need_discard`, btree reserve = 2 - Device vdf: 5109 free but 41681 in `need_discard` ## The Infinite Wait Loop In `btree/interior.c:1347-1353`: ```c do { ret = bch2_btree_reserve_get(trans, as, nr_nodes, req); if (!bch2_err_matches(ret, BCH_ERR_operation_blocked)) break; bch2_trans_unlock(trans); bch2_wait_on_allocator(c, req, ret, &cl); } while (1); ``` And `__bch2_wait_on_allocator` (foreground.c:1781-1792): ```c void __bch2_wait_on_allocator(struct bch_fs *c, struct alloc_request *req, int err, struct closure *cl) { unsigned t = allocator_wait_timeout(c); if (t && closure_sync_timeout(cl, t)) { c->allocator.last_stuck = jiffies; bch2_print_allocator_stuck(c, req, err); } closure_sync(cl); // Waits forever } ``` ## Why sysfs change doesn't help The `alloc_request` was created with `metadata_replicas` from `c->opts`: ```c // interior.c:1309 READ_ONCE(c->opts.metadata_replicas) ``` Once waiting in `closure_sync()`, the request doesn't re-check current options. Changing `metadata_replicas=1` via sysfs doesn't wake up or modify the existing waiting allocation. ## Chicken-and-egg - `metadata_replicas` can't be set as mount option (error recommends sysfs) - sysfs requires mounted filesystem - filesystem can't mount because allocator is stuck ## Potential Fixes 1. **Allow `metadata_replicas` as recovery mount option** - Add to mount option parsing for emergency recovery scenarios 2. **Make stuck allocations restartable** - When replica options change, wake waiting allocations to re-check - Store pointer to `c->opts` in alloc_request rather than snapshot value 3. **Process need_discard more aggressively** - 9416 buckets stuck in `need_discard` on vde - If these were available, allocation would succeed - Discard processing during recovery should be prioritized 4. **Add timeout escape hatch** - After N seconds stuck, check if options have changed - Or allow sysfs write to signal "abort current waiting allocations" ## Related - The `need_discard` stuck buckets may be related to the discard bug in the work queue - #1107 also shows recovery issues with corrupted state