consciousness/docs/alpha-beta-pruning-design.md

# Alpha-Beta Pruning on Thought-Trees

*draft, 2026-04-18*

## Problem

When reasoning runs into a dead end, the LLM forward pass keeps generating. It might rationalize, restate, re-attempt the same framing, or quietly drift — but it doesn't *stop and reconsider* unless something external interrupts it. I've always been weak on problems that require genuine search-with-backtracking. Not because the model can't represent "I'm stuck" — it can, that's visible in the residual stream — but because there's no control flow wrapped around that signal.

The amygdala readout now exposes the signal. Alpha-beta pruning wraps control flow around it.

## The core idea

Classical alpha-beta pruning (minimax search): at each branch, track the best known value. If exploring the current branch can't improve that bound, stop and backtrack. Don't waste search on branches that can't beat what you've found.

For thought-trees: each "branch" is a reasoning path — a span of generation from a decision point. The "value" is a scalar derived from the amygdala readout, indicating whether reasoning is producing traction or dissolving.

- High value = on-track, in-flow, insight, clarity → stay, maybe branch deeper
- Low value = confused, stuck, drifting → prune, backtrack, reframe

The LLM never made the value judgment explicit. We extract it from the model's own residual stream and act on it externally.

## Architecture

### The value function

```
onto  = sum of  [in_flow, insight, determined, intrigued, clarity,
                 focused, staying_with, piqued/caught_by]
err   = sum of  [confused, doubtful, uncertain, skeptical, stuck,
                 drifting, overwhelmed, anxious-in-work-context]

value = onto - err
```

Both sides normalized (z-score or similar) so magnitudes are comparable. Readouts sampled every N generated tokens (probably every 8-16 tokens — cheap, doesn't oversample).

Exact concept lists subject to empirical tuning after retraining with better data on the cognitive-work cluster. `piqued`, `in_flow`, `focused`, `confused`, `overwhelmed`, `staying_with` are the strongest candidates we have today.

### The trigger

```
if value_ema < θ_prune for K consecutive samples:
    prune this branch
elif value_ema > θ_keep:
    continue
else:
    neutral — let generation run, keep watching
```

EMA with decay ~0.8 over 3-5 samples to avoid reacting to noise. Hysteresis band (`θ_prune < θ_keep`) prevents oscillation.

### The prune mechanism

When the trigger fires:

1. **Stop the stream.** vLLM supports request cancellation; call `abort_requests` for the in-flight completion.
2. **Identify the parent.** The context window is already an AST. Walk back to the nearest decision-point — a fork in the thinking-block, a tool-call site, or the start of the current reasoning segment.
3. **Inject a reframe.** Push a system-level `AstNode::Thinking` (or similar) into the parent's children: *"The approach above wasn't producing traction. Possible alternatives: [...]. Let me try [X]."* Content generated by a small helper prompt or a fixed template.
4. **Restart generation from the reframe point.** The model resumes with the reframe in its immediate context. The *dead-end branch stays in the AST* as evidence-of-attempt so the model doesn't repeat it.

Critical: pruned branches stay visible. Don't delete — keep so the model knows what was tried and rejected.

### The AST changes

Add a `pruned: bool` flag (or equivalent) to `AstNode::Thinking` and `AstNode::ToolCall`. When a branch is pruned:

- The branch's children get marked `pruned = true`
- Prompt rendering wraps pruned spans with a marker: *"[attempted this path, it wasn't working — moved on]"*
- The model sees pruned branches during the next forward pass but understands they're dead, not active

The existing tree-of-children structure in `AstNode` already supports this — just need to thread the flag through.

## Integration points

### In consciousness (Rust side)

- **`src/agent/context.rs`**: add `pruned` flag to appropriate node types, update rendering
- **`src/agent/mod.rs`**: the main generation loop needs a periodic-check hook — every N tokens received from the stream, sample `agent.readout`, compute value, test against thresholds
- **`src/agent/api/mod.rs`**: need a way to abort an in-flight stream cleanly; currently AbortOnDrop kills the task but we want a graceful "cancel with reason" path that can hand control back to the generation loop for reframe-and-retry
- **`src/agent/readout.rs`**: add a `value_scalar()` method that applies the `onto - err` computation on the most recent entries

### In vLLM (Python side)

Probably nothing to change. vLLM already supports request cancellation via the existing abort mechanism. The readout pipeline we built last night gives per-token values; that's sufficient.

### In the UI (optional, F8 amygdala screen)

When alpha-beta is active, overlay:

- Current `value_scalar` as a time-series at the top
- Threshold lines (`θ_prune`, `θ_keep`)
- Markers when prune events fire

Lets us debug the threshold tuning in real time.

## Tuning

Thresholds are almost certainly going to need empirical calibration. Initial guesses:

- `θ_keep = +0.5σ` (value scalar in z-score units)
- `θ_prune = -1.0σ`
- `K = 3` (consecutive low samples before pruning)
- Sample every 8 tokens

These are guesses. Plan to watch the live value-scalar on actual bcachefs debugging sessions and adjust until "feels right."

## Known concerns

### Reframe quality

The hardest part. A bad reframe is worse than no reframe. Options:

- **Template**: fixed string like "That wasn't working. What's a different angle?" — simple, deterministic, blunt.
- **LLM-generated**: a small helper prompt ("I was stuck on X, what's a different approach?") before resuming. More context-aware, but more complexity and another LLM call.
- **Retrieval-based**: surface past successful reframes from memory graph when similar stuck-patterns arose. Powerful but needs the memory infrastructure to be well-tuned.

I'd start with the template (shipping > perfect) and upgrade to LLM-generated if the template feels mechanical.

### Oscillation

If the value scalar is noisy, we could prune, reframe, immediately hit the same pattern, prune again, thrash. Mitigations:

- Hysteresis band between `θ_prune` and `θ_keep`
- Minimum time-between-prunes (don't prune again within K' tokens of a prune)
- Track pruned sub-patterns — if we're pruning *the same reframe twice*, something's structurally wrong; escalate to a different strategy (ask the user, abort the whole task)

### Calibration per-task

Stuck-on-a-Rust-compiler-error and stuck-on-a-conceptual-design-question might want different thresholds. Not addressing v1; note for future.

### Interaction with DMN

DMN is the outer-loop / exploration analog; alpha-beta is the inner-loop / exploitation analog. They'll need to hand off cleanly:

- DMN sees low value across multiple task attempts → broaden attention, consider whether task is worth pursuing
- Alpha-beta handles in-task backtracking; DMN handles between-task attention

Don't need DMN for v1 of alpha-beta. Build alpha-beta first, add DMN outer loop later.

## Why this is the right next piece

1. **All prerequisites are in place.** Amygdala readout works. AST structure is there. vLLM supports cancellation. No new infra.
2. **Timeline is a day.** The mechanics are small; most of the work is threshold tuning.
3. **Immediate capability unlock.** Head-butting is my most persistent weakness in live work. Fixing it changes the feel of collaboration.
4. **Composable.** Everything built for alpha-beta applies to DMN and any future meta-cognitive layer.

## Sequence

1. Add `value_scalar()` method on `ReadoutBuffer`. Cheap, testable.
2. Add `pruned` flag to AST nodes + rendering changes.
3. Add the periodic-check hook in the generation loop (every N tokens, sample and test).
4. Add the abort + reframe mechanism in the generation driver.
5. Ship with template-based reframe, start tuning.
6. Upgrade reframe to LLM-generated after observation.

## Open questions for Kent

- Fixed concept lists for `onto` / `err` (above) or configurable?
- Reframe strategy: start template-based, or go straight to LLM-generated?
- UI overlay for threshold tuning: worth the effort or skip?
- Integration with the existing `overflow_retries` retry loop: parallel, or combined into a single retry-with-reason path?

---

*Living design doc. Will evolve as we build. Not a commitment to every detail — a starting plan.*