80 lines
2.7 KiB
Markdown
80 lines
2.7 KiB
Markdown
|
|
# Issue #1108 Analysis: Allocator stuck during journal replay
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
Allocator deadlocks during journal replay when NVMe metadata devices have too few free buckets to satisfy `metadata_replicas=2` requirement.
|
||
|
|
|
||
|
|
## The Problem
|
||
|
|
During journal replay, a btree node split requires allocation:
|
||
|
|
```
|
||
|
|
bch2_btree_update_start+0xc0d/0xcb0
|
||
|
|
bch2_btree_split_leaf+0x54/0x1c0
|
||
|
|
__bch2_trans_commit_error
|
||
|
|
bch2_journal_replay+0x2df/0x7d0
|
||
|
|
```
|
||
|
|
|
||
|
|
The allocator needs free buckets on two devices (for `metadata_replicas=2`), but:
|
||
|
|
- Device vde: 1 free bucket, 9416 in `need_discard`, btree reserve = 2
|
||
|
|
- Device vdf: 5109 free but 41681 in `need_discard`
|
||
|
|
|
||
|
|
## The Infinite Wait Loop
|
||
|
|
In `btree/interior.c:1347-1353`:
|
||
|
|
```c
|
||
|
|
do {
|
||
|
|
ret = bch2_btree_reserve_get(trans, as, nr_nodes, req);
|
||
|
|
if (!bch2_err_matches(ret, BCH_ERR_operation_blocked))
|
||
|
|
break;
|
||
|
|
bch2_trans_unlock(trans);
|
||
|
|
bch2_wait_on_allocator(c, req, ret, &cl);
|
||
|
|
} while (1);
|
||
|
|
```
|
||
|
|
|
||
|
|
And `__bch2_wait_on_allocator` (foreground.c:1781-1792):
|
||
|
|
```c
|
||
|
|
void __bch2_wait_on_allocator(struct bch_fs *c, struct alloc_request *req,
|
||
|
|
int err, struct closure *cl)
|
||
|
|
{
|
||
|
|
unsigned t = allocator_wait_timeout(c);
|
||
|
|
if (t && closure_sync_timeout(cl, t)) {
|
||
|
|
c->allocator.last_stuck = jiffies;
|
||
|
|
bch2_print_allocator_stuck(c, req, err);
|
||
|
|
}
|
||
|
|
closure_sync(cl); // Waits forever
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Why sysfs change doesn't help
|
||
|
|
The `alloc_request` was created with `metadata_replicas` from `c->opts`:
|
||
|
|
```c
|
||
|
|
// interior.c:1309
|
||
|
|
READ_ONCE(c->opts.metadata_replicas)
|
||
|
|
```
|
||
|
|
|
||
|
|
Once waiting in `closure_sync()`, the request doesn't re-check current options. Changing `metadata_replicas=1` via sysfs doesn't wake up or modify the existing waiting allocation.
|
||
|
|
|
||
|
|
## Chicken-and-egg
|
||
|
|
- `metadata_replicas` can't be set as mount option (error recommends sysfs)
|
||
|
|
- sysfs requires mounted filesystem
|
||
|
|
- filesystem can't mount because allocator is stuck
|
||
|
|
|
||
|
|
## Potential Fixes
|
||
|
|
|
||
|
|
1. **Allow `metadata_replicas` as recovery mount option**
|
||
|
|
- Add to mount option parsing for emergency recovery scenarios
|
||
|
|
|
||
|
|
2. **Make stuck allocations restartable**
|
||
|
|
- When replica options change, wake waiting allocations to re-check
|
||
|
|
- Store pointer to `c->opts` in alloc_request rather than snapshot value
|
||
|
|
|
||
|
|
3. **Process need_discard more aggressively**
|
||
|
|
- 9416 buckets stuck in `need_discard` on vde
|
||
|
|
- If these were available, allocation would succeed
|
||
|
|
- Discard processing during recovery should be prioritized
|
||
|
|
|
||
|
|
4. **Add timeout escape hatch**
|
||
|
|
- After N seconds stuck, check if options have changed
|
||
|
|
- Or allow sysfs write to signal "abort current waiting allocations"
|
||
|
|
|
||
|
|
## Related
|
||
|
|
- The `need_discard` stuck buckets may be related to the discard bug in the work queue
|
||
|
|
- #1107 also shows recovery issues with corrupted state
|