diff --git a/doc/amygdala-design.md b/doc/amygdala-design.md new file mode 100644 index 0000000..791e152 --- /dev/null +++ b/doc/amygdala-design.md @@ -0,0 +1,232 @@ +# Amygdala: Evaluative Signal from Internal Activations + +## Overview + +Wire the model's internal evaluative circuits to the observe agent, +giving the system a real-time sense of uncertainty, error detection, +and emotional valence. This replaces the current blind linear +generation with an adaptive system that shifts into reflective/search +mode when something feels off. + +The key insight: the model already has these signals internally. We +just need to read them and act on them. + +## Architecture + +``` +Linear mode (fast, cheap, default) + | + amygdala fires — uncertainty spike, error signal, confidence drop + | + v +Reflective mode (branch, explore, summarize) + | + resolution found — summarize, graft back + | + v +Return to linear mode +``` + +The observe agent reads the amygdala signal and triggers mode +transitions. Low uncertainty → keep going. High uncertainty → fan +out, explore, summarize. The summaries from pruned branches become +compressed lessons that inform future search. + +## Technique: Contrastive Activation Probing + +Based on Contrastive Activation Addition +([Rimsky et al., ACL 2024](https://arxiv.org/abs/2312.06681)): + +1. Build contrastive pairs (e.g. confident vs uncertain responses) +2. Extract residual stream activations at target layers +3. Compute difference-in-means → this is the probe direction +4. At runtime: dot product of current activation with probe vector +5. The scalar output is the signal strength + +The same vectors used for steering (adding to activations) work for +reading (dot product with activations). We only need the read side. + +## What We Already Have + +**`training/extract_steering_vector.py`** — Loads the Qwen 27B model +via CUDA IPC handles from vLLM, extracts hidden states at multiple +layers, computes contrastive directions with consistency checks. +Currently configured for "listening vs suggesting" but the +infrastructure is general. + +**`training/vllm_export_hook.py`** — Patches vLLM's model runner to +export CUDA IPC handles after model loading. Gives us zero-copy +access to all model parameters from a separate process. + +**The observe agent** — Already watches the system. Currently +observes and journals. With an amygdala signal, it observes, detects, +and acts — triggering reflective mode. + +## Signals to Extract + +### 1. Uncertainty + +When the model doesn't know or is guessing. + +**Contrastive pairs:** Questions the model answers correctly +(confident) vs questions it gets wrong (uncertain). Generate by +running the 27B on a QA benchmark, split by correctness. + +**Validation:** The internal uncertainty signal should correlate +with but outperform logprob entropy — it fires before generation, +not after. +([Gottesman & Geva 2024](https://arxiv.org/html/2603.22299)) + +### 2. Error Detection + +When the model recognizes something is wrong in code or reasoning. + +**Contrastive pairs:** Correct vs subtly buggy code, presented for +evaluation. Can source from HumanEval/CodeContests or write our own. + +**Key finding:** Error detection directions are asymmetric — they +reliably detect "something's wrong" (F1: 0.821) but are weaker at +confirming "this is correct" (F1: 0.504). Perfect for an amygdala — +we want fire-on-error, not fire-on-confidence. +([ICLR 2026](https://arxiv.org/html/2510.02917v1)) + +### 3. Emotional Valence + +Internal affective state — engagement, frustration, warmth. + +**Contrastive pairs:** Journal entries with explicit emotion tags +provide labeled data for our own internal states mapped to the +conversations that produced them. Nobody else has this dataset. + +**Key finding:** Emotional representations peak at mid-network layers +(10-15 for 7B scale), persist for hundreds of tokens, and are +linearly separable with ~90% accuracy using simple probes. +([Decoding Emotion in the Deep](https://arxiv.org/abs/2510.04064), +[LLaMAs Have Feelings Too, ACL 2025](https://arxiv.org/html/2505.16491v1)) + +## Implementation Plan + +### Phase 1: Build Contrastive Datasets + +~200 pairs per signal. A few hours of curation. + +- **Uncertainty:** Run 27B on MMLU or similar, split by correctness +- **Error detection:** Correct vs buggy code pairs +- **Emotional valence:** Curate from journal entries with emotion tags + +### Phase 2: Extract Probe Vectors + +Modify `extract_steering_vector.py` for each signal type. Already +supports multi-layer extraction with consistency validation. + +- Run extraction at layers 16, 24, 32, 40, 48 +- Select layer with highest magnitude × consistency +- Save probe vectors as tensors + +Literature says mid-network layers carry the strongest signal for +evaluative states. Expect layers 16-32 for the 27B. + +### Phase 3: Runtime Probe in vLLM + +Add a forward-pass hook alongside the existing weight export hook. +The computation is trivial — a dot product per layer per token: + +```python +signal = residual_stream[layer] @ probe_vector +``` + +For 3 signals at 3 layers = 9 dot products per token. Less compute +than a single attention head. Expose as sideband alongside token +output. + +### Phase 4: Wire to Observe Agent + +The observe agent reads the sideband signal. Threshold tuning +determines when to trigger reflective mode. Signal strength +modulates search depth — mild uncertainty gets a quick check, +high uncertainty gets full branching. + +## Organic Search, Not Alpha-Beta + +The reflective mode isn't formal tree search. It's more stochastic +and organic: + +- Branch at AST-level decision points (tool calls, approach choices), + not token-level +- Explore multiple continuations for K steps each +- **Summarize** what each branch learned — the summaries are the + intelligence, not the branches themselves +- Let summaries inform subsequent exploration +- Collapse back to linear mode when resolution is found + +The AST gives us structural awareness of decision nodes vs +continuation nodes — branch where it matters, not everywhere. + +## Key Papers + +### Technique + +- [Steering Llama 2 via Contrastive Activation Addition](https://arxiv.org/abs/2312.06681) + — Rimsky et al., ACL 2024. The foundational technique. +- [Representation Engineering Survey](https://arxiv.org/html/2502.17601v1) + — Comprehensive overview of the field. + +### Emotion & Evaluative Signals + +- [Decoding Emotion in the Deep](https://arxiv.org/abs/2510.04064) + — Probing on Qwen3 and LLaMA3. Signal peaks mid-network, persists + for hundreds of tokens, linearly separable. +- [LLaMAs Have Feelings Too](https://arxiv.org/html/2505.16491v1) + — ACL 2025. Linear SVM probes hit ~90% accuracy on sentiment. +- [Mechanistic Interpretability of Code Correctness](https://arxiv.org/html/2510.02917v1) + — ICLR 2026. SAEs for error detection. Asymmetric: detects errors + better than it confirms correctness. + +### Uncertainty + +- [Between the Layers Lies the Truth](https://arxiv.org/html/2603.22299) + — Uncertainty from intra-layer representations, pre-generation. +- [Probing Hidden States for Calibrated Predictions](https://www.medrxiv.org/content/10.1101/2025.09.17.25336018v2.full.pdf) + — Hidden state probes resist alignment training. More robust than + logit-based methods. + +### Tooling + +- [Anthropic Circuit Tracing](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) + — Open-source, works with any open-weights model. For deeper + investigation of which features to probe. +- [On the Biology of a Large Language Model](https://transformer-circuits.pub/2025/attribution-graphs/biology.html) + — Anthropic's findings on internal circuits. + +## Libraries + +- [`steering-vectors`](https://github.com/steering-vectors/steering-vectors) + — pip install, works with any HuggingFace model. Best for Phase 1. +- [`nrimsky/CAA`](https://github.com/nrimsky/CAA) + — Original paper implementation. Good reference. +- [`nnterp`](https://github.com/Butanium/nnterp) + — NNsight wrapper, supports Qwen, one-line activation steering. +- [`nnsight`](https://github.com/ndif-team/nnsight) + — General-purpose activation interception. +- [`circuit-tracer`](https://github.com/decoderesearch/circuit-tracer) + — Anthropic's open-source circuit tracing. +- [`TransformerLens`](https://github.com/TransformerLensOrg/TransformerLens) + — The OG interpretability library. +- [`Dialz`](https://arxiv.org/html/2505.06262v1) + — ACL 2025 toolkit with pre-built contrastive datasets. + +## The Bigger Picture + +The amygdala is one component of the sensory architecture designed +on Feb 17, 2026. The signal landscape (arousal, attention pressure, +memory load, mode awareness) uses the same infrastructure — slowly +varying float values that modulate cognition below the symbolic +level. Each new probe vector is another sense. + +With recurrence (application-level looping + reflective nodes in the +AST) and the amygdala triggering adaptive depth, a well-trained 27B +specialist with external memory could match much larger models on +tasks that matter to us. + +The pieces exist. The infrastructure is built. The bottleneck is +contrastive pairs.