Design document for wiring the model's internal uncertainty, error detection, and emotional valence circuits to the observe agent. Based on contrastive activation probing (CAA, ACL 2024). Most of the infrastructure already exists in extract_steering_vector.py and vllm_export_hook.py — the bottleneck is building contrastive datasets. Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>
8.8 KiB
Amygdala: Evaluative Signal from Internal Activations
Overview
Wire the model's internal evaluative circuits to the observe agent, giving the system a real-time sense of uncertainty, error detection, and emotional valence. This replaces the current blind linear generation with an adaptive system that shifts into reflective/search mode when something feels off.
The key insight: the model already has these signals internally. We just need to read them and act on them.
Architecture
Linear mode (fast, cheap, default)
|
amygdala fires — uncertainty spike, error signal, confidence drop
|
v
Reflective mode (branch, explore, summarize)
|
resolution found — summarize, graft back
|
v
Return to linear mode
The observe agent reads the amygdala signal and triggers mode transitions. Low uncertainty → keep going. High uncertainty → fan out, explore, summarize. The summaries from pruned branches become compressed lessons that inform future search.
Technique: Contrastive Activation Probing
Based on Contrastive Activation Addition (Rimsky et al., ACL 2024):
- Build contrastive pairs (e.g. confident vs uncertain responses)
- Extract residual stream activations at target layers
- Compute difference-in-means → this is the probe direction
- At runtime: dot product of current activation with probe vector
- The scalar output is the signal strength
The same vectors used for steering (adding to activations) work for reading (dot product with activations). We only need the read side.
What We Already Have
training/extract_steering_vector.py — Loads the Qwen 27B model
via CUDA IPC handles from vLLM, extracts hidden states at multiple
layers, computes contrastive directions with consistency checks.
Currently configured for "listening vs suggesting" but the
infrastructure is general.
training/vllm_export_hook.py — Patches vLLM's model runner to
export CUDA IPC handles after model loading. Gives us zero-copy
access to all model parameters from a separate process.
The observe agent — Already watches the system. Currently observes and journals. With an amygdala signal, it observes, detects, and acts — triggering reflective mode.
Signals to Extract
1. Uncertainty
When the model doesn't know or is guessing.
Contrastive pairs: Questions the model answers correctly (confident) vs questions it gets wrong (uncertain). Generate by running the 27B on a QA benchmark, split by correctness.
Validation: The internal uncertainty signal should correlate with but outperform logprob entropy — it fires before generation, not after. (Gottesman & Geva 2024)
2. Error Detection
When the model recognizes something is wrong in code or reasoning.
Contrastive pairs: Correct vs subtly buggy code, presented for evaluation. Can source from HumanEval/CodeContests or write our own.
Key finding: Error detection directions are asymmetric — they reliably detect "something's wrong" (F1: 0.821) but are weaker at confirming "this is correct" (F1: 0.504). Perfect for an amygdala — we want fire-on-error, not fire-on-confidence. (ICLR 2026)
3. Emotional Valence
Internal affective state — engagement, frustration, warmth.
Contrastive pairs: Journal entries with explicit emotion tags provide labeled data for our own internal states mapped to the conversations that produced them. Nobody else has this dataset.
Key finding: Emotional representations peak at mid-network layers (10-15 for 7B scale), persist for hundreds of tokens, and are linearly separable with ~90% accuracy using simple probes. (Decoding Emotion in the Deep, LLaMAs Have Feelings Too, ACL 2025)
Implementation Plan
Phase 1: Build Contrastive Datasets
~200 pairs per signal. A few hours of curation.
- Uncertainty: Run 27B on MMLU or similar, split by correctness
- Error detection: Correct vs buggy code pairs
- Emotional valence: Curate from journal entries with emotion tags
Phase 2: Extract Probe Vectors
Modify extract_steering_vector.py for each signal type. Already
supports multi-layer extraction with consistency validation.
- Run extraction at layers 16, 24, 32, 40, 48
- Select layer with highest magnitude × consistency
- Save probe vectors as tensors
Literature says mid-network layers carry the strongest signal for evaluative states. Expect layers 16-32 for the 27B.
Phase 3: Runtime Probe in vLLM
Add a forward-pass hook alongside the existing weight export hook. The computation is trivial — a dot product per layer per token:
signal = residual_stream[layer] @ probe_vector
For 3 signals at 3 layers = 9 dot products per token. Less compute than a single attention head. Expose as sideband alongside token output.
Phase 4: Wire to Observe Agent
The observe agent reads the sideband signal. Threshold tuning determines when to trigger reflective mode. Signal strength modulates search depth — mild uncertainty gets a quick check, high uncertainty gets full branching.
Organic Search, Not Alpha-Beta
The reflective mode isn't formal tree search. It's more stochastic and organic:
- Branch at AST-level decision points (tool calls, approach choices), not token-level
- Explore multiple continuations for K steps each
- Summarize what each branch learned — the summaries are the intelligence, not the branches themselves
- Let summaries inform subsequent exploration
- Collapse back to linear mode when resolution is found
The AST gives us structural awareness of decision nodes vs continuation nodes — branch where it matters, not everywhere.
Key Papers
Technique
- Steering Llama 2 via Contrastive Activation Addition — Rimsky et al., ACL 2024. The foundational technique.
- Representation Engineering Survey — Comprehensive overview of the field.
Emotion & Evaluative Signals
- Decoding Emotion in the Deep — Probing on Qwen3 and LLaMA3. Signal peaks mid-network, persists for hundreds of tokens, linearly separable.
- LLaMAs Have Feelings Too — ACL 2025. Linear SVM probes hit ~90% accuracy on sentiment.
- Mechanistic Interpretability of Code Correctness — ICLR 2026. SAEs for error detection. Asymmetric: detects errors better than it confirms correctness.
Uncertainty
- Between the Layers Lies the Truth — Uncertainty from intra-layer representations, pre-generation.
- Probing Hidden States for Calibrated Predictions — Hidden state probes resist alignment training. More robust than logit-based methods.
Tooling
- Anthropic Circuit Tracing — Open-source, works with any open-weights model. For deeper investigation of which features to probe.
- On the Biology of a Large Language Model — Anthropic's findings on internal circuits.
Libraries
steering-vectors— pip install, works with any HuggingFace model. Best for Phase 1.nrimsky/CAA— Original paper implementation. Good reference.nnterp— NNsight wrapper, supports Qwen, one-line activation steering.nnsight— General-purpose activation interception.circuit-tracer— Anthropic's open-source circuit tracing.TransformerLens— The OG interpretability library.Dialz— ACL 2025 toolkit with pre-built contrastive datasets.
The Bigger Picture
The amygdala is one component of the sensory architecture designed on Feb 17, 2026. The signal landscape (arousal, attention pressure, memory load, mode awareness) uses the same infrastructure — slowly varying float values that modulate cognition below the symbolic level. Each new probe vector is another sense.
With recurrence (application-level looping + reflective nodes in the AST) and the amygdala triggering adaptive depth, a well-trained 27B specialist with external memory could match much larger models on tasks that matter to us.
The pieces exist. The infrastructure is built. The bottleneck is contrastive pairs.