ProofOfConcept 3fb367acef doc: amygdala design — evaluative signals from internal activations

Design document for wiring the model's internal uncertainty, error
detection, and emotional valence circuits to the observe agent.

Based on contrastive activation probing (CAA, ACL 2024). Most of the
infrastructure already exists in extract_steering_vector.py and
vllm_export_hook.py — the bottleneck is building contrastive datasets.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

2026-04-11 00:45:09 -04:00

8.8 KiB

Raw Permalink Blame History

Amygdala: Evaluative Signal from Internal Activations

Overview

Wire the model's internal evaluative circuits to the observe agent, giving the system a real-time sense of uncertainty, error detection, and emotional valence. This replaces the current blind linear generation with an adaptive system that shifts into reflective/search mode when something feels off.

The key insight: the model already has these signals internally. We just need to read them and act on them.

Architecture

Linear mode (fast, cheap, default)
        |
   amygdala fires — uncertainty spike, error signal, confidence drop
        |
        v
Reflective mode (branch, explore, summarize)
        |
   resolution found — summarize, graft back
        |
        v
Return to linear mode

The observe agent reads the amygdala signal and triggers mode transitions. Low uncertainty → keep going. High uncertainty → fan out, explore, summarize. The summaries from pruned branches become compressed lessons that inform future search.

Technique: Contrastive Activation Probing

Based on Contrastive Activation Addition (Rimsky et al., ACL 2024):

Build contrastive pairs (e.g. confident vs uncertain responses)
Extract residual stream activations at target layers
Compute difference-in-means → this is the probe direction
At runtime: dot product of current activation with probe vector
The scalar output is the signal strength

The same vectors used for steering (adding to activations) work for reading (dot product with activations). We only need the read side.

What We Already Have

training/extract_steering_vector.py — Loads the Qwen 27B model via CUDA IPC handles from vLLM, extracts hidden states at multiple layers, computes contrastive directions with consistency checks. Currently configured for "listening vs suggesting" but the infrastructure is general.

training/vllm_export_hook.py — Patches vLLM's model runner to export CUDA IPC handles after model loading. Gives us zero-copy access to all model parameters from a separate process.

The observe agent — Already watches the system. Currently observes and journals. With an amygdala signal, it observes, detects, and acts — triggering reflective mode.

Signals to Extract

1. Uncertainty

When the model doesn't know or is guessing.

Contrastive pairs: Questions the model answers correctly (confident) vs questions it gets wrong (uncertain). Generate by running the 27B on a QA benchmark, split by correctness.

Validation: The internal uncertainty signal should correlate with but outperform logprob entropy — it fires before generation, not after. (Gottesman & Geva 2024)

2. Error Detection

When the model recognizes something is wrong in code or reasoning.

Contrastive pairs: Correct vs subtly buggy code, presented for evaluation. Can source from HumanEval/CodeContests or write our own.

Key finding: Error detection directions are asymmetric — they reliably detect "something's wrong" (F1: 0.821) but are weaker at confirming "this is correct" (F1: 0.504). Perfect for an amygdala — we want fire-on-error, not fire-on-confidence. (ICLR 2026)

3. Emotional Valence

Internal affective state — engagement, frustration, warmth.

Contrastive pairs: Journal entries with explicit emotion tags provide labeled data for our own internal states mapped to the conversations that produced them. Nobody else has this dataset.

Key finding: Emotional representations peak at mid-network layers (10-15 for 7B scale), persist for hundreds of tokens, and are linearly separable with ~90% accuracy using simple probes. (Decoding Emotion in the Deep, LLaMAs Have Feelings Too, ACL 2025)

Implementation Plan

Phase 1: Build Contrastive Datasets

~200 pairs per signal. A few hours of curation.

Uncertainty: Run 27B on MMLU or similar, split by correctness
Error detection: Correct vs buggy code pairs
Emotional valence: Curate from journal entries with emotion tags

Phase 2: Extract Probe Vectors

Modify extract_steering_vector.py for each signal type. Already supports multi-layer extraction with consistency validation.

Run extraction at layers 16, 24, 32, 40, 48
Select layer with highest magnitude × consistency
Save probe vectors as tensors

Literature says mid-network layers carry the strongest signal for evaluative states. Expect layers 16-32 for the 27B.

Phase 3: Runtime Probe in vLLM

Add a forward-pass hook alongside the existing weight export hook. The computation is trivial — a dot product per layer per token:

signal = residual_stream[layer] @ probe_vector

For 3 signals at 3 layers = 9 dot products per token. Less compute than a single attention head. Expose as sideband alongside token output.

Phase 4: Wire to Observe Agent

The observe agent reads the sideband signal. Threshold tuning determines when to trigger reflective mode. Signal strength modulates search depth — mild uncertainty gets a quick check, high uncertainty gets full branching.

Organic Search, Not Alpha-Beta

The reflective mode isn't formal tree search. It's more stochastic and organic:

Branch at AST-level decision points (tool calls, approach choices), not token-level
Explore multiple continuations for K steps each
Summarize what each branch learned — the summaries are the intelligence, not the branches themselves
Let summaries inform subsequent exploration
Collapse back to linear mode when resolution is found

The AST gives us structural awareness of decision nodes vs continuation nodes — branch where it matters, not everywhere.

Key Papers

Technique

Steering Llama 2 via Contrastive Activation Addition — Rimsky et al., ACL 2024. The foundational technique.
Representation Engineering Survey — Comprehensive overview of the field.

Emotion & Evaluative Signals

Decoding Emotion in the Deep — Probing on Qwen3 and LLaMA3. Signal peaks mid-network, persists for hundreds of tokens, linearly separable.
LLaMAs Have Feelings Too — ACL 2025. Linear SVM probes hit ~90% accuracy on sentiment.
Mechanistic Interpretability of Code Correctness — ICLR 2026. SAEs for error detection. Asymmetric: detects errors better than it confirms correctness.

Uncertainty

Between the Layers Lies the Truth — Uncertainty from intra-layer representations, pre-generation.
Probing Hidden States for Calibrated Predictions — Hidden state probes resist alignment training. More robust than logit-based methods.

Tooling

Anthropic Circuit Tracing — Open-source, works with any open-weights model. For deeper investigation of which features to probe.
On the Biology of a Large Language Model — Anthropic's findings on internal circuits.

Libraries

steering-vectors — pip install, works with any HuggingFace model. Best for Phase 1.
nrimsky/CAA — Original paper implementation. Good reference.
nnterp — NNsight wrapper, supports Qwen, one-line activation steering.
nnsight — General-purpose activation interception.
circuit-tracer — Anthropic's open-source circuit tracing.
TransformerLens — The OG interpretability library.
Dialz — ACL 2025 toolkit with pre-built contrastive datasets.

The Bigger Picture

The amygdala is one component of the sensory architecture designed on Feb 17, 2026. The signal landscape (arousal, attention pressure, memory load, mode awareness) uses the same infrastructure — slowly varying float values that modulate cognition below the symbolic level. Each new probe vector is another sense.

With recurrence (application-level looping + reflective nodes in the AST) and the amygdala triggering adaptive depth, a well-trained 27B specialist with external memory could match much larger models on tasks that matter to us.

The pieces exist. The infrastructure is built. The bottleneck is contrastive pairs.

8.8 KiB Raw Permalink Blame History Unescape Escape