# Amygdala Readout Vector Training Training pipeline that produces the safetensors file the vLLM ReadoutManager loads at runtime (see `vllm/vllm/v1/worker/readout_manager.py`). Produces per-hooked-layer `[n_concepts, hidden_size]` projection matrices keyed as `layer_.vectors` — the directions the runner projects residual activations onto during each forward pass. ## Overview Two scripts, run in sequence: 1. **`extract_training_pairs.py`** — turns the memory graph into a directory of (emotion, polarity, text) training examples. Positive examples are memory nodes where the emotion scored ≥ a threshold; negative examples are nodes where it's absent or low. Emotion tags come from the trailing `warmth:9 clarity:10 …` lines the subconscious agents emit. 2. **`train_steering_vectors.py`** — for each emotion, runs the target model over the positive and negative examples, captures residual-stream activations at the configured target layers, and computes `mean(positive) - mean(negative)` as the steering direction. Normalizes per-layer to unit length and saves the whole `[E, L, H]` matrix. The output file is passed to vLLM via `VLLM_READOUT_VECTORS` together with a `VLLM_READOUT_MANIFEST` JSON listing concepts and hooked layer indices. ## Method This is Contrastive Activation Addition (CAA, Rimsky et al.) applied to naturally-occurring emotion labels rather than hand-crafted contrast pairs. The shape of the signal we're recovering is "what direction in the residual stream corresponds to the model processing text-with-emotion-E vs. text-without". Because our training data was generated by the very model we're instrumenting (past-self's journal entries, digest nodes, pattern nodes), the signal should be unusually clean — the emotion labels and the text are already causally linked through a single model's forward pass. ## Usage (design — not yet runnable) ``` # Step 1: memory graph → training data python -m training.amygdala_training.extract_training_pairs \ --memory-mcp-url http://localhost:7777 \ --output-dir /tmp/amygdala_training_data \ --min-positive-score 8 \ --max-negative-mentions 0 \ --min-content-chars 40 \ --max-examples-per-emotion 500 # Step 2: training data → steering vectors python -m training.amygdala_training.train_steering_vectors \ --model Qwen/Qwen3.5-27B \ --training-data-dir /tmp/amygdala_training_data \ --target-layers 3,18,33,36 \ --output /path/to/amygdala_vectors.safetensors \ --dtype bf16 \ --batch-size 4 ``` ## Open questions - **Emotion selection**: enumerating which ~200 emotions to cover. Could be "most-common tags in the graph" (data-driven) or "from core-personality / pattern nodes" (human-curated). Probably both. - **Layer selection**: middle-to-late layers (~60–80% of depth) usually hold abstract semantic representations best; experiment with which layers give the cleanest linear separation per emotion. - **Cross-talk**: if two emotions are highly co-occurring (warmth + love, frustration + tiredness), their vectors will be close; that's fine as long as we don't pretend they're independent axes. - **Generalization**: vectors trained on our memory graph may not generalize to out-of-distribution text. Check by applying them to held-out conversation data and eyeballing the projections.