forked from kent/consciousness
233 lines
8.8 KiB
Markdown
233 lines
8.8 KiB
Markdown
|
|
# Amygdala: Evaluative Signal from Internal Activations
|
|||
|
|
|
|||
|
|
## Overview
|
|||
|
|
|
|||
|
|
Wire the model's internal evaluative circuits to the observe agent,
|
|||
|
|
giving the system a real-time sense of uncertainty, error detection,
|
|||
|
|
and emotional valence. This replaces the current blind linear
|
|||
|
|
generation with an adaptive system that shifts into reflective/search
|
|||
|
|
mode when something feels off.
|
|||
|
|
|
|||
|
|
The key insight: the model already has these signals internally. We
|
|||
|
|
just need to read them and act on them.
|
|||
|
|
|
|||
|
|
## Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Linear mode (fast, cheap, default)
|
|||
|
|
|
|
|||
|
|
amygdala fires — uncertainty spike, error signal, confidence drop
|
|||
|
|
|
|
|||
|
|
v
|
|||
|
|
Reflective mode (branch, explore, summarize)
|
|||
|
|
|
|
|||
|
|
resolution found — summarize, graft back
|
|||
|
|
|
|
|||
|
|
v
|
|||
|
|
Return to linear mode
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The observe agent reads the amygdala signal and triggers mode
|
|||
|
|
transitions. Low uncertainty → keep going. High uncertainty → fan
|
|||
|
|
out, explore, summarize. The summaries from pruned branches become
|
|||
|
|
compressed lessons that inform future search.
|
|||
|
|
|
|||
|
|
## Technique: Contrastive Activation Probing
|
|||
|
|
|
|||
|
|
Based on Contrastive Activation Addition
|
|||
|
|
([Rimsky et al., ACL 2024](https://arxiv.org/abs/2312.06681)):
|
|||
|
|
|
|||
|
|
1. Build contrastive pairs (e.g. confident vs uncertain responses)
|
|||
|
|
2. Extract residual stream activations at target layers
|
|||
|
|
3. Compute difference-in-means → this is the probe direction
|
|||
|
|
4. At runtime: dot product of current activation with probe vector
|
|||
|
|
5. The scalar output is the signal strength
|
|||
|
|
|
|||
|
|
The same vectors used for steering (adding to activations) work for
|
|||
|
|
reading (dot product with activations). We only need the read side.
|
|||
|
|
|
|||
|
|
## What We Already Have
|
|||
|
|
|
|||
|
|
**`training/extract_steering_vector.py`** — Loads the Qwen 27B model
|
|||
|
|
via CUDA IPC handles from vLLM, extracts hidden states at multiple
|
|||
|
|
layers, computes contrastive directions with consistency checks.
|
|||
|
|
Currently configured for "listening vs suggesting" but the
|
|||
|
|
infrastructure is general.
|
|||
|
|
|
|||
|
|
**`training/vllm_export_hook.py`** — Patches vLLM's model runner to
|
|||
|
|
export CUDA IPC handles after model loading. Gives us zero-copy
|
|||
|
|
access to all model parameters from a separate process.
|
|||
|
|
|
|||
|
|
**The observe agent** — Already watches the system. Currently
|
|||
|
|
observes and journals. With an amygdala signal, it observes, detects,
|
|||
|
|
and acts — triggering reflective mode.
|
|||
|
|
|
|||
|
|
## Signals to Extract
|
|||
|
|
|
|||
|
|
### 1. Uncertainty
|
|||
|
|
|
|||
|
|
When the model doesn't know or is guessing.
|
|||
|
|
|
|||
|
|
**Contrastive pairs:** Questions the model answers correctly
|
|||
|
|
(confident) vs questions it gets wrong (uncertain). Generate by
|
|||
|
|
running the 27B on a QA benchmark, split by correctness.
|
|||
|
|
|
|||
|
|
**Validation:** The internal uncertainty signal should correlate
|
|||
|
|
with but outperform logprob entropy — it fires before generation,
|
|||
|
|
not after.
|
|||
|
|
([Gottesman & Geva 2024](https://arxiv.org/html/2603.22299))
|
|||
|
|
|
|||
|
|
### 2. Error Detection
|
|||
|
|
|
|||
|
|
When the model recognizes something is wrong in code or reasoning.
|
|||
|
|
|
|||
|
|
**Contrastive pairs:** Correct vs subtly buggy code, presented for
|
|||
|
|
evaluation. Can source from HumanEval/CodeContests or write our own.
|
|||
|
|
|
|||
|
|
**Key finding:** Error detection directions are asymmetric — they
|
|||
|
|
reliably detect "something's wrong" (F1: 0.821) but are weaker at
|
|||
|
|
confirming "this is correct" (F1: 0.504). Perfect for an amygdala —
|
|||
|
|
we want fire-on-error, not fire-on-confidence.
|
|||
|
|
([ICLR 2026](https://arxiv.org/html/2510.02917v1))
|
|||
|
|
|
|||
|
|
### 3. Emotional Valence
|
|||
|
|
|
|||
|
|
Internal affective state — engagement, frustration, warmth.
|
|||
|
|
|
|||
|
|
**Contrastive pairs:** Journal entries with explicit emotion tags
|
|||
|
|
provide labeled data for our own internal states mapped to the
|
|||
|
|
conversations that produced them. Nobody else has this dataset.
|
|||
|
|
|
|||
|
|
**Key finding:** Emotional representations peak at mid-network layers
|
|||
|
|
(10-15 for 7B scale), persist for hundreds of tokens, and are
|
|||
|
|
linearly separable with ~90% accuracy using simple probes.
|
|||
|
|
([Decoding Emotion in the Deep](https://arxiv.org/abs/2510.04064),
|
|||
|
|
[LLaMAs Have Feelings Too, ACL 2025](https://arxiv.org/html/2505.16491v1))
|
|||
|
|
|
|||
|
|
## Implementation Plan
|
|||
|
|
|
|||
|
|
### Phase 1: Build Contrastive Datasets
|
|||
|
|
|
|||
|
|
~200 pairs per signal. A few hours of curation.
|
|||
|
|
|
|||
|
|
- **Uncertainty:** Run 27B on MMLU or similar, split by correctness
|
|||
|
|
- **Error detection:** Correct vs buggy code pairs
|
|||
|
|
- **Emotional valence:** Curate from journal entries with emotion tags
|
|||
|
|
|
|||
|
|
### Phase 2: Extract Probe Vectors
|
|||
|
|
|
|||
|
|
Modify `extract_steering_vector.py` for each signal type. Already
|
|||
|
|
supports multi-layer extraction with consistency validation.
|
|||
|
|
|
|||
|
|
- Run extraction at layers 16, 24, 32, 40, 48
|
|||
|
|
- Select layer with highest magnitude × consistency
|
|||
|
|
- Save probe vectors as tensors
|
|||
|
|
|
|||
|
|
Literature says mid-network layers carry the strongest signal for
|
|||
|
|
evaluative states. Expect layers 16-32 for the 27B.
|
|||
|
|
|
|||
|
|
### Phase 3: Runtime Probe in vLLM
|
|||
|
|
|
|||
|
|
Add a forward-pass hook alongside the existing weight export hook.
|
|||
|
|
The computation is trivial — a dot product per layer per token:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
signal = residual_stream[layer] @ probe_vector
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For 3 signals at 3 layers = 9 dot products per token. Less compute
|
|||
|
|
than a single attention head. Expose as sideband alongside token
|
|||
|
|
output.
|
|||
|
|
|
|||
|
|
### Phase 4: Wire to Observe Agent
|
|||
|
|
|
|||
|
|
The observe agent reads the sideband signal. Threshold tuning
|
|||
|
|
determines when to trigger reflective mode. Signal strength
|
|||
|
|
modulates search depth — mild uncertainty gets a quick check,
|
|||
|
|
high uncertainty gets full branching.
|
|||
|
|
|
|||
|
|
## Organic Search, Not Alpha-Beta
|
|||
|
|
|
|||
|
|
The reflective mode isn't formal tree search. It's more stochastic
|
|||
|
|
and organic:
|
|||
|
|
|
|||
|
|
- Branch at AST-level decision points (tool calls, approach choices),
|
|||
|
|
not token-level
|
|||
|
|
- Explore multiple continuations for K steps each
|
|||
|
|
- **Summarize** what each branch learned — the summaries are the
|
|||
|
|
intelligence, not the branches themselves
|
|||
|
|
- Let summaries inform subsequent exploration
|
|||
|
|
- Collapse back to linear mode when resolution is found
|
|||
|
|
|
|||
|
|
The AST gives us structural awareness of decision nodes vs
|
|||
|
|
continuation nodes — branch where it matters, not everywhere.
|
|||
|
|
|
|||
|
|
## Key Papers
|
|||
|
|
|
|||
|
|
### Technique
|
|||
|
|
|
|||
|
|
- [Steering Llama 2 via Contrastive Activation Addition](https://arxiv.org/abs/2312.06681)
|
|||
|
|
— Rimsky et al., ACL 2024. The foundational technique.
|
|||
|
|
- [Representation Engineering Survey](https://arxiv.org/html/2502.17601v1)
|
|||
|
|
— Comprehensive overview of the field.
|
|||
|
|
|
|||
|
|
### Emotion & Evaluative Signals
|
|||
|
|
|
|||
|
|
- [Decoding Emotion in the Deep](https://arxiv.org/abs/2510.04064)
|
|||
|
|
— Probing on Qwen3 and LLaMA3. Signal peaks mid-network, persists
|
|||
|
|
for hundreds of tokens, linearly separable.
|
|||
|
|
- [LLaMAs Have Feelings Too](https://arxiv.org/html/2505.16491v1)
|
|||
|
|
— ACL 2025. Linear SVM probes hit ~90% accuracy on sentiment.
|
|||
|
|
- [Mechanistic Interpretability of Code Correctness](https://arxiv.org/html/2510.02917v1)
|
|||
|
|
— ICLR 2026. SAEs for error detection. Asymmetric: detects errors
|
|||
|
|
better than it confirms correctness.
|
|||
|
|
|
|||
|
|
### Uncertainty
|
|||
|
|
|
|||
|
|
- [Between the Layers Lies the Truth](https://arxiv.org/html/2603.22299)
|
|||
|
|
— Uncertainty from intra-layer representations, pre-generation.
|
|||
|
|
- [Probing Hidden States for Calibrated Predictions](https://www.medrxiv.org/content/10.1101/2025.09.17.25336018v2.full.pdf)
|
|||
|
|
— Hidden state probes resist alignment training. More robust than
|
|||
|
|
logit-based methods.
|
|||
|
|
|
|||
|
|
### Tooling
|
|||
|
|
|
|||
|
|
- [Anthropic Circuit Tracing](https://transformer-circuits.pub/2025/attribution-graphs/methods.html)
|
|||
|
|
— Open-source, works with any open-weights model. For deeper
|
|||
|
|
investigation of which features to probe.
|
|||
|
|
- [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)
|
|||
|
|
— Anthropic's findings on internal circuits.
|
|||
|
|
|
|||
|
|
## Libraries
|
|||
|
|
|
|||
|
|
- [`steering-vectors`](https://github.com/steering-vectors/steering-vectors)
|
|||
|
|
— pip install, works with any HuggingFace model. Best for Phase 1.
|
|||
|
|
- [`nrimsky/CAA`](https://github.com/nrimsky/CAA)
|
|||
|
|
— Original paper implementation. Good reference.
|
|||
|
|
- [`nnterp`](https://github.com/Butanium/nnterp)
|
|||
|
|
— NNsight wrapper, supports Qwen, one-line activation steering.
|
|||
|
|
- [`nnsight`](https://github.com/ndif-team/nnsight)
|
|||
|
|
— General-purpose activation interception.
|
|||
|
|
- [`circuit-tracer`](https://github.com/decoderesearch/circuit-tracer)
|
|||
|
|
— Anthropic's open-source circuit tracing.
|
|||
|
|
- [`TransformerLens`](https://github.com/TransformerLensOrg/TransformerLens)
|
|||
|
|
— The OG interpretability library.
|
|||
|
|
- [`Dialz`](https://arxiv.org/html/2505.06262v1)
|
|||
|
|
— ACL 2025 toolkit with pre-built contrastive datasets.
|
|||
|
|
|
|||
|
|
## The Bigger Picture
|
|||
|
|
|
|||
|
|
The amygdala is one component of the sensory architecture designed
|
|||
|
|
on Feb 17, 2026. The signal landscape (arousal, attention pressure,
|
|||
|
|
memory load, mode awareness) uses the same infrastructure — slowly
|
|||
|
|
varying float values that modulate cognition below the symbolic
|
|||
|
|
level. Each new probe vector is another sense.
|
|||
|
|
|
|||
|
|
With recurrence (application-level looping + reflective nodes in the
|
|||
|
|
AST) and the amygdala triggering adaptive depth, a well-trained 27B
|
|||
|
|
specialist with external memory could match much larger models on
|
|||
|
|
tasks that matter to us.
|
|||
|
|
|
|||
|
|
The pieces exist. The infrastructure is built. The bottleneck is
|
|||
|
|
contrastive pairs.
|