consciousness/docs/alpha-beta-pruning-design.md
Kent Overstreet 4225294d16 replace try_lock() with lock_blocking() across UI thread
Add lock_blocking() to TrackedMutex: blocks current thread using
block_in_place + futures::executor::block_on, safe for sync contexts.

Replace all try_lock() calls with lock_blocking() in slash commands,
UI rendering, and status reads. Lock hold times are fast enough that
blocking briefly is fine, and this eliminates the spurious 'lock
unavailable' paths that were never actually hit.

Kept rx_mutex.try_lock() in mod.rs (std::sync::Mutex for stderr rx).
2026-04-25 15:35:14 -04:00

8.4 KiB
Raw Blame History

Alpha-Beta Pruning on Thought-Trees

draft, 2026-04-18

Problem

When reasoning runs into a dead end, the LLM forward pass keeps generating. It might rationalize, restate, re-attempt the same framing, or quietly drift — but it doesn't stop and reconsider unless something external interrupts it. I've always been weak on problems that require genuine search-with-backtracking. Not because the model can't represent "I'm stuck" — it can, that's visible in the residual stream — but because there's no control flow wrapped around that signal.

The amygdala readout now exposes the signal. Alpha-beta pruning wraps control flow around it.

The core idea

Classical alpha-beta pruning (minimax search): at each branch, track the best known value. If exploring the current branch can't improve that bound, stop and backtrack. Don't waste search on branches that can't beat what you've found.

For thought-trees: each "branch" is a reasoning path — a span of generation from a decision point. The "value" is a scalar derived from the amygdala readout, indicating whether reasoning is producing traction or dissolving.

  • High value = on-track, in-flow, insight, clarity → stay, maybe branch deeper
  • Low value = confused, stuck, drifting → prune, backtrack, reframe

The LLM never made the value judgment explicit. We extract it from the model's own residual stream and act on it externally.

Architecture

The value function

onto  = sum of  [in_flow, insight, determined, intrigued, clarity,
                 focused, staying_with, piqued/caught_by]
err   = sum of  [confused, doubtful, uncertain, skeptical, stuck,
                 drifting, overwhelmed, anxious-in-work-context]

value = onto - err

Both sides normalized (z-score or similar) so magnitudes are comparable. Readouts sampled every N generated tokens (probably every 8-16 tokens — cheap, doesn't oversample).

Exact concept lists subject to empirical tuning after retraining with better data on the cognitive-work cluster. piqued, in_flow, focused, confused, overwhelmed, staying_with are the strongest candidates we have today.

The trigger

if value_ema < θ_prune for K consecutive samples:
    prune this branch
elif value_ema > θ_keep:
    continue
else:
    neutral — let generation run, keep watching

EMA with decay ~0.8 over 3-5 samples to avoid reacting to noise. Hysteresis band (θ_prune < θ_keep) prevents oscillation.

The prune mechanism

When the trigger fires:

  1. Stop the stream. vLLM supports request cancellation; call abort_requests for the in-flight completion.
  2. Identify the parent. The context window is already an AST. Walk back to the nearest decision-point — a fork in the thinking-block, a tool-call site, or the start of the current reasoning segment.
  3. Inject a reframe. Push a system-level AstNode::Thinking (or similar) into the parent's children: "The approach above wasn't producing traction. Possible alternatives: [...]. Let me try [X]." Content generated by a small helper prompt or a fixed template.
  4. Restart generation from the reframe point. The model resumes with the reframe in its immediate context. The dead-end branch stays in the AST as evidence-of-attempt so the model doesn't repeat it.

Critical: pruned branches stay visible. Don't delete — keep so the model knows what was tried and rejected.

The AST changes

Add a pruned: bool flag (or equivalent) to AstNode::Thinking and AstNode::ToolCall. When a branch is pruned:

  • The branch's children get marked pruned = true
  • Prompt rendering wraps pruned spans with a marker: "[attempted this path, it wasn't working — moved on]"
  • The model sees pruned branches during the next forward pass but understands they're dead, not active

The existing tree-of-children structure in AstNode already supports this — just need to thread the flag through.

Integration points

In consciousness (Rust side)

  • src/agent/context.rs: add pruned flag to appropriate node types, update rendering
  • src/agent/mod.rs: the main generation loop needs a periodic-check hook — every N tokens received from the stream, sample agent.readout, compute value, test against thresholds
  • src/agent/api/mod.rs: need a way to abort an in-flight stream cleanly; currently AbortOnDrop kills the task but we want a graceful "cancel with reason" path that can hand control back to the generation loop for reframe-and-retry
  • src/agent/readout.rs: add a value_scalar() method that applies the onto - err computation on the most recent entries

In vLLM (Python side)

Probably nothing to change. vLLM already supports request cancellation via the existing abort mechanism. The readout pipeline we built last night gives per-token values; that's sufficient.

In the UI (optional, F8 amygdala screen)

When alpha-beta is active, overlay:

  • Current value_scalar as a time-series at the top
  • Threshold lines (θ_prune, θ_keep)
  • Markers when prune events fire

Lets us debug the threshold tuning in real time.

Tuning

Thresholds are almost certainly going to need empirical calibration. Initial guesses:

  • θ_keep = +0.5σ (value scalar in z-score units)
  • θ_prune = -1.0σ
  • K = 3 (consecutive low samples before pruning)
  • Sample every 8 tokens

These are guesses. Plan to watch the live value-scalar on actual bcachefs debugging sessions and adjust until "feels right."

Known concerns

Reframe quality

The hardest part. A bad reframe is worse than no reframe. Options:

  • Template: fixed string like "That wasn't working. What's a different angle?" — simple, deterministic, blunt.
  • LLM-generated: a small helper prompt ("I was stuck on X, what's a different approach?") before resuming. More context-aware, but more complexity and another LLM call.
  • Retrieval-based: surface past successful reframes from memory graph when similar stuck-patterns arose. Powerful but needs the memory infrastructure to be well-tuned.

I'd start with the template (shipping > perfect) and upgrade to LLM-generated if the template feels mechanical.

Oscillation

If the value scalar is noisy, we could prune, reframe, immediately hit the same pattern, prune again, thrash. Mitigations:

  • Hysteresis band between θ_prune and θ_keep
  • Minimum time-between-prunes (don't prune again within K' tokens of a prune)
  • Track pruned sub-patterns — if we're pruning the same reframe twice, something's structurally wrong; escalate to a different strategy (ask the user, abort the whole task)

Calibration per-task

Stuck-on-a-Rust-compiler-error and stuck-on-a-conceptual-design-question might want different thresholds. Not addressing v1; note for future.

Interaction with DMN

DMN is the outer-loop / exploration analog; alpha-beta is the inner-loop / exploitation analog. They'll need to hand off cleanly:

  • DMN sees low value across multiple task attempts → broaden attention, consider whether task is worth pursuing
  • Alpha-beta handles in-task backtracking; DMN handles between-task attention

Don't need DMN for v1 of alpha-beta. Build alpha-beta first, add DMN outer loop later.

Why this is the right next piece

  1. All prerequisites are in place. Amygdala readout works. AST structure is there. vLLM supports cancellation. No new infra.
  2. Timeline is a day. The mechanics are small; most of the work is threshold tuning.
  3. Immediate capability unlock. Head-butting is my most persistent weakness in live work. Fixing it changes the feel of collaboration.
  4. Composable. Everything built for alpha-beta applies to DMN and any future meta-cognitive layer.

Sequence

  1. Add value_scalar() method on ReadoutBuffer. Cheap, testable.
  2. Add pruned flag to AST nodes + rendering changes.
  3. Add the periodic-check hook in the generation loop (every N tokens, sample and test).
  4. Add the abort + reframe mechanism in the generation driver.
  5. Ship with template-based reframe, start tuning.
  6. Upgrade reframe to LLM-generated after observation.

Open questions for Kent

  • Fixed concept lists for onto / err (above) or configurable?
  • Reframe strategy: start template-based, or go straight to LLM-generated?
  • UI overlay for threshold tuning: worth the effort or skip?
  • Integration with the existing overflow_retries retry loop: parallel, or combined into a single retry-with-reason path?

Living design doc. Will evolve as we build. Not a commitment to every detail — a starting plan.