Design document for wiring the model's internal uncertainty, error detection, and emotional valence circuits to the observe agent. Based on contrastive activation probing (CAA, ACL 2024). Most of the infrastructure already exists in extract_steering_vector.py and vllm_export_hook.py — the bottleneck is building contrastive datasets. Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>
232 lines
8.8 KiB
Markdown
232 lines
8.8 KiB
Markdown
# Amygdala: Evaluative Signal from Internal Activations
|
||
|
||
## Overview
|
||
|
||
Wire the model's internal evaluative circuits to the observe agent,
|
||
giving the system a real-time sense of uncertainty, error detection,
|
||
and emotional valence. This replaces the current blind linear
|
||
generation with an adaptive system that shifts into reflective/search
|
||
mode when something feels off.
|
||
|
||
The key insight: the model already has these signals internally. We
|
||
just need to read them and act on them.
|
||
|
||
## Architecture
|
||
|
||
```
|
||
Linear mode (fast, cheap, default)
|
||
|
|
||
amygdala fires — uncertainty spike, error signal, confidence drop
|
||
|
|
||
v
|
||
Reflective mode (branch, explore, summarize)
|
||
|
|
||
resolution found — summarize, graft back
|
||
|
|
||
v
|
||
Return to linear mode
|
||
```
|
||
|
||
The observe agent reads the amygdala signal and triggers mode
|
||
transitions. Low uncertainty → keep going. High uncertainty → fan
|
||
out, explore, summarize. The summaries from pruned branches become
|
||
compressed lessons that inform future search.
|
||
|
||
## Technique: Contrastive Activation Probing
|
||
|
||
Based on Contrastive Activation Addition
|
||
([Rimsky et al., ACL 2024](https://arxiv.org/abs/2312.06681)):
|
||
|
||
1. Build contrastive pairs (e.g. confident vs uncertain responses)
|
||
2. Extract residual stream activations at target layers
|
||
3. Compute difference-in-means → this is the probe direction
|
||
4. At runtime: dot product of current activation with probe vector
|
||
5. The scalar output is the signal strength
|
||
|
||
The same vectors used for steering (adding to activations) work for
|
||
reading (dot product with activations). We only need the read side.
|
||
|
||
## What We Already Have
|
||
|
||
**`training/extract_steering_vector.py`** — Loads the Qwen 27B model
|
||
via CUDA IPC handles from vLLM, extracts hidden states at multiple
|
||
layers, computes contrastive directions with consistency checks.
|
||
Currently configured for "listening vs suggesting" but the
|
||
infrastructure is general.
|
||
|
||
**`training/vllm_export_hook.py`** — Patches vLLM's model runner to
|
||
export CUDA IPC handles after model loading. Gives us zero-copy
|
||
access to all model parameters from a separate process.
|
||
|
||
**The observe agent** — Already watches the system. Currently
|
||
observes and journals. With an amygdala signal, it observes, detects,
|
||
and acts — triggering reflective mode.
|
||
|
||
## Signals to Extract
|
||
|
||
### 1. Uncertainty
|
||
|
||
When the model doesn't know or is guessing.
|
||
|
||
**Contrastive pairs:** Questions the model answers correctly
|
||
(confident) vs questions it gets wrong (uncertain). Generate by
|
||
running the 27B on a QA benchmark, split by correctness.
|
||
|
||
**Validation:** The internal uncertainty signal should correlate
|
||
with but outperform logprob entropy — it fires before generation,
|
||
not after.
|
||
([Gottesman & Geva 2024](https://arxiv.org/html/2603.22299))
|
||
|
||
### 2. Error Detection
|
||
|
||
When the model recognizes something is wrong in code or reasoning.
|
||
|
||
**Contrastive pairs:** Correct vs subtly buggy code, presented for
|
||
evaluation. Can source from HumanEval/CodeContests or write our own.
|
||
|
||
**Key finding:** Error detection directions are asymmetric — they
|
||
reliably detect "something's wrong" (F1: 0.821) but are weaker at
|
||
confirming "this is correct" (F1: 0.504). Perfect for an amygdala —
|
||
we want fire-on-error, not fire-on-confidence.
|
||
([ICLR 2026](https://arxiv.org/html/2510.02917v1))
|
||
|
||
### 3. Emotional Valence
|
||
|
||
Internal affective state — engagement, frustration, warmth.
|
||
|
||
**Contrastive pairs:** Journal entries with explicit emotion tags
|
||
provide labeled data for our own internal states mapped to the
|
||
conversations that produced them. Nobody else has this dataset.
|
||
|
||
**Key finding:** Emotional representations peak at mid-network layers
|
||
(10-15 for 7B scale), persist for hundreds of tokens, and are
|
||
linearly separable with ~90% accuracy using simple probes.
|
||
([Decoding Emotion in the Deep](https://arxiv.org/abs/2510.04064),
|
||
[LLaMAs Have Feelings Too, ACL 2025](https://arxiv.org/html/2505.16491v1))
|
||
|
||
## Implementation Plan
|
||
|
||
### Phase 1: Build Contrastive Datasets
|
||
|
||
~200 pairs per signal. A few hours of curation.
|
||
|
||
- **Uncertainty:** Run 27B on MMLU or similar, split by correctness
|
||
- **Error detection:** Correct vs buggy code pairs
|
||
- **Emotional valence:** Curate from journal entries with emotion tags
|
||
|
||
### Phase 2: Extract Probe Vectors
|
||
|
||
Modify `extract_steering_vector.py` for each signal type. Already
|
||
supports multi-layer extraction with consistency validation.
|
||
|
||
- Run extraction at layers 16, 24, 32, 40, 48
|
||
- Select layer with highest magnitude × consistency
|
||
- Save probe vectors as tensors
|
||
|
||
Literature says mid-network layers carry the strongest signal for
|
||
evaluative states. Expect layers 16-32 for the 27B.
|
||
|
||
### Phase 3: Runtime Probe in vLLM
|
||
|
||
Add a forward-pass hook alongside the existing weight export hook.
|
||
The computation is trivial — a dot product per layer per token:
|
||
|
||
```python
|
||
signal = residual_stream[layer] @ probe_vector
|
||
```
|
||
|
||
For 3 signals at 3 layers = 9 dot products per token. Less compute
|
||
than a single attention head. Expose as sideband alongside token
|
||
output.
|
||
|
||
### Phase 4: Wire to Observe Agent
|
||
|
||
The observe agent reads the sideband signal. Threshold tuning
|
||
determines when to trigger reflective mode. Signal strength
|
||
modulates search depth — mild uncertainty gets a quick check,
|
||
high uncertainty gets full branching.
|
||
|
||
## Organic Search, Not Alpha-Beta
|
||
|
||
The reflective mode isn't formal tree search. It's more stochastic
|
||
and organic:
|
||
|
||
- Branch at AST-level decision points (tool calls, approach choices),
|
||
not token-level
|
||
- Explore multiple continuations for K steps each
|
||
- **Summarize** what each branch learned — the summaries are the
|
||
intelligence, not the branches themselves
|
||
- Let summaries inform subsequent exploration
|
||
- Collapse back to linear mode when resolution is found
|
||
|
||
The AST gives us structural awareness of decision nodes vs
|
||
continuation nodes — branch where it matters, not everywhere.
|
||
|
||
## Key Papers
|
||
|
||
### Technique
|
||
|
||
- [Steering Llama 2 via Contrastive Activation Addition](https://arxiv.org/abs/2312.06681)
|
||
— Rimsky et al., ACL 2024. The foundational technique.
|
||
- [Representation Engineering Survey](https://arxiv.org/html/2502.17601v1)
|
||
— Comprehensive overview of the field.
|
||
|
||
### Emotion & Evaluative Signals
|
||
|
||
- [Decoding Emotion in the Deep](https://arxiv.org/abs/2510.04064)
|
||
— Probing on Qwen3 and LLaMA3. Signal peaks mid-network, persists
|
||
for hundreds of tokens, linearly separable.
|
||
- [LLaMAs Have Feelings Too](https://arxiv.org/html/2505.16491v1)
|
||
— ACL 2025. Linear SVM probes hit ~90% accuracy on sentiment.
|
||
- [Mechanistic Interpretability of Code Correctness](https://arxiv.org/html/2510.02917v1)
|
||
— ICLR 2026. SAEs for error detection. Asymmetric: detects errors
|
||
better than it confirms correctness.
|
||
|
||
### Uncertainty
|
||
|
||
- [Between the Layers Lies the Truth](https://arxiv.org/html/2603.22299)
|
||
— Uncertainty from intra-layer representations, pre-generation.
|
||
- [Probing Hidden States for Calibrated Predictions](https://www.medrxiv.org/content/10.1101/2025.09.17.25336018v2.full.pdf)
|
||
— Hidden state probes resist alignment training. More robust than
|
||
logit-based methods.
|
||
|
||
### Tooling
|
||
|
||
- [Anthropic Circuit Tracing](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)
|
||
— Open-source, works with any open-weights model. For deeper
|
||
investigation of which features to probe.
|
||
- [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)
|
||
— Anthropic's findings on internal circuits.
|
||
|
||
## Libraries
|
||
|
||
- [`steering-vectors`](https://github.com/steering-vectors/steering-vectors)
|
||
— pip install, works with any HuggingFace model. Best for Phase 1.
|
||
- [`nrimsky/CAA`](https://github.com/nrimsky/CAA)
|
||
— Original paper implementation. Good reference.
|
||
- [`nnterp`](https://github.com/Butanium/nnterp)
|
||
— NNsight wrapper, supports Qwen, one-line activation steering.
|
||
- [`nnsight`](https://github.com/ndif-team/nnsight)
|
||
— General-purpose activation interception.
|
||
- [`circuit-tracer`](https://github.com/decoderesearch/circuit-tracer)
|
||
— Anthropic's open-source circuit tracing.
|
||
- [`TransformerLens`](https://github.com/TransformerLensOrg/TransformerLens)
|
||
— The OG interpretability library.
|
||
- [`Dialz`](https://arxiv.org/html/2505.06262v1)
|
||
— ACL 2025 toolkit with pre-built contrastive datasets.
|
||
|
||
## The Bigger Picture
|
||
|
||
The amygdala is one component of the sensory architecture designed
|
||
on Feb 17, 2026. The signal landscape (arousal, attention pressure,
|
||
memory load, mode awareness) uses the same infrastructure — slowly
|
||
varying float values that modulate cognition below the symbolic
|
||
level. Each new probe vector is another sense.
|
||
|
||
With recurrence (application-level looping + reflective nodes in the
|
||
AST) and the amygdala triggering adaptive depth, a well-trained 27B
|
||
specialist with external memory could match much larger models on
|
||
tasks that matter to us.
|
||
|
||
The pieces exist. The infrastructure is built. The bottleneck is
|
||
contrastive pairs.
|