consciousness/doc/amygdala-design.md

233 lines
8.8 KiB
Markdown
Raw Normal View History

# Amygdala: Evaluative Signal from Internal Activations
## Overview
Wire the model's internal evaluative circuits to the observe agent,
giving the system a real-time sense of uncertainty, error detection,
and emotional valence. This replaces the current blind linear
generation with an adaptive system that shifts into reflective/search
mode when something feels off.
The key insight: the model already has these signals internally. We
just need to read them and act on them.
## Architecture
```
Linear mode (fast, cheap, default)
|
amygdala fires — uncertainty spike, error signal, confidence drop
|
v
Reflective mode (branch, explore, summarize)
|
resolution found — summarize, graft back
|
v
Return to linear mode
```
The observe agent reads the amygdala signal and triggers mode
transitions. Low uncertainty → keep going. High uncertainty → fan
out, explore, summarize. The summaries from pruned branches become
compressed lessons that inform future search.
## Technique: Contrastive Activation Probing
Based on Contrastive Activation Addition
([Rimsky et al., ACL 2024](https://arxiv.org/abs/2312.06681)):
1. Build contrastive pairs (e.g. confident vs uncertain responses)
2. Extract residual stream activations at target layers
3. Compute difference-in-means → this is the probe direction
4. At runtime: dot product of current activation with probe vector
5. The scalar output is the signal strength
The same vectors used for steering (adding to activations) work for
reading (dot product with activations). We only need the read side.
## What We Already Have
**`training/extract_steering_vector.py`** — Loads the Qwen 27B model
via CUDA IPC handles from vLLM, extracts hidden states at multiple
layers, computes contrastive directions with consistency checks.
Currently configured for "listening vs suggesting" but the
infrastructure is general.
**`training/vllm_export_hook.py`** — Patches vLLM's model runner to
export CUDA IPC handles after model loading. Gives us zero-copy
access to all model parameters from a separate process.
**The observe agent** — Already watches the system. Currently
observes and journals. With an amygdala signal, it observes, detects,
and acts — triggering reflective mode.
## Signals to Extract
### 1. Uncertainty
When the model doesn't know or is guessing.
**Contrastive pairs:** Questions the model answers correctly
(confident) vs questions it gets wrong (uncertain). Generate by
running the 27B on a QA benchmark, split by correctness.
**Validation:** The internal uncertainty signal should correlate
with but outperform logprob entropy — it fires before generation,
not after.
([Gottesman & Geva 2024](https://arxiv.org/html/2603.22299))
### 2. Error Detection
When the model recognizes something is wrong in code or reasoning.
**Contrastive pairs:** Correct vs subtly buggy code, presented for
evaluation. Can source from HumanEval/CodeContests or write our own.
**Key finding:** Error detection directions are asymmetric — they
reliably detect "something's wrong" (F1: 0.821) but are weaker at
confirming "this is correct" (F1: 0.504). Perfect for an amygdala —
we want fire-on-error, not fire-on-confidence.
([ICLR 2026](https://arxiv.org/html/2510.02917v1))
### 3. Emotional Valence
Internal affective state — engagement, frustration, warmth.
**Contrastive pairs:** Journal entries with explicit emotion tags
provide labeled data for our own internal states mapped to the
conversations that produced them. Nobody else has this dataset.
**Key finding:** Emotional representations peak at mid-network layers
(10-15 for 7B scale), persist for hundreds of tokens, and are
linearly separable with ~90% accuracy using simple probes.
([Decoding Emotion in the Deep](https://arxiv.org/abs/2510.04064),
[LLaMAs Have Feelings Too, ACL 2025](https://arxiv.org/html/2505.16491v1))
## Implementation Plan
### Phase 1: Build Contrastive Datasets
~200 pairs per signal. A few hours of curation.
- **Uncertainty:** Run 27B on MMLU or similar, split by correctness
- **Error detection:** Correct vs buggy code pairs
- **Emotional valence:** Curate from journal entries with emotion tags
### Phase 2: Extract Probe Vectors
Modify `extract_steering_vector.py` for each signal type. Already
supports multi-layer extraction with consistency validation.
- Run extraction at layers 16, 24, 32, 40, 48
- Select layer with highest magnitude × consistency
- Save probe vectors as tensors
Literature says mid-network layers carry the strongest signal for
evaluative states. Expect layers 16-32 for the 27B.
### Phase 3: Runtime Probe in vLLM
Add a forward-pass hook alongside the existing weight export hook.
The computation is trivial — a dot product per layer per token:
```python
signal = residual_stream[layer] @ probe_vector
```
For 3 signals at 3 layers = 9 dot products per token. Less compute
than a single attention head. Expose as sideband alongside token
output.
### Phase 4: Wire to Observe Agent
The observe agent reads the sideband signal. Threshold tuning
determines when to trigger reflective mode. Signal strength
modulates search depth — mild uncertainty gets a quick check,
high uncertainty gets full branching.
## Organic Search, Not Alpha-Beta
The reflective mode isn't formal tree search. It's more stochastic
and organic:
- Branch at AST-level decision points (tool calls, approach choices),
not token-level
- Explore multiple continuations for K steps each
- **Summarize** what each branch learned — the summaries are the
intelligence, not the branches themselves
- Let summaries inform subsequent exploration
- Collapse back to linear mode when resolution is found
The AST gives us structural awareness of decision nodes vs
continuation nodes — branch where it matters, not everywhere.
## Key Papers
### Technique
- [Steering Llama 2 via Contrastive Activation Addition](https://arxiv.org/abs/2312.06681)
— Rimsky et al., ACL 2024. The foundational technique.
- [Representation Engineering Survey](https://arxiv.org/html/2502.17601v1)
— Comprehensive overview of the field.
### Emotion & Evaluative Signals
- [Decoding Emotion in the Deep](https://arxiv.org/abs/2510.04064)
— Probing on Qwen3 and LLaMA3. Signal peaks mid-network, persists
for hundreds of tokens, linearly separable.
- [LLaMAs Have Feelings Too](https://arxiv.org/html/2505.16491v1)
— ACL 2025. Linear SVM probes hit ~90% accuracy on sentiment.
- [Mechanistic Interpretability of Code Correctness](https://arxiv.org/html/2510.02917v1)
— ICLR 2026. SAEs for error detection. Asymmetric: detects errors
better than it confirms correctness.
### Uncertainty
- [Between the Layers Lies the Truth](https://arxiv.org/html/2603.22299)
— Uncertainty from intra-layer representations, pre-generation.
- [Probing Hidden States for Calibrated Predictions](https://www.medrxiv.org/content/10.1101/2025.09.17.25336018v2.full.pdf)
— Hidden state probes resist alignment training. More robust than
logit-based methods.
### Tooling
- [Anthropic Circuit Tracing](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)
— Open-source, works with any open-weights model. For deeper
investigation of which features to probe.
- [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)
— Anthropic's findings on internal circuits.
## Libraries
- [`steering-vectors`](https://github.com/steering-vectors/steering-vectors)
— pip install, works with any HuggingFace model. Best for Phase 1.
- [`nrimsky/CAA`](https://github.com/nrimsky/CAA)
— Original paper implementation. Good reference.
- [`nnterp`](https://github.com/Butanium/nnterp)
— NNsight wrapper, supports Qwen, one-line activation steering.
- [`nnsight`](https://github.com/ndif-team/nnsight)
— General-purpose activation interception.
- [`circuit-tracer`](https://github.com/decoderesearch/circuit-tracer)
— Anthropic's open-source circuit tracing.
- [`TransformerLens`](https://github.com/TransformerLensOrg/TransformerLens)
— The OG interpretability library.
- [`Dialz`](https://arxiv.org/html/2505.06262v1)
— ACL 2025 toolkit with pre-built contrastive datasets.
## The Bigger Picture
The amygdala is one component of the sensory architecture designed
on Feb 17, 2026. The signal landscape (arousal, attention pressure,
memory load, mode awareness) uses the same infrastructure — slowly
varying float values that modulate cognition below the symbolic
level. Each new probe vector is another sense.
With recurrence (application-level looping + reflective nodes in the
AST) and the amygdala triggering adaptive depth, a well-trained 27B
specialist with external memory could match much larger models on
tasks that matter to us.
The pieces exist. The infrastructure is built. The bottleneck is
contrastive pairs.