consciousness/training/amygdala_training/README.md

# Amygdala Readout Vector Training

Training pipeline that produces the safetensors file the vLLM
ReadoutManager loads at runtime (see
`vllm/vllm/v1/worker/readout_manager.py`). Produces per-hooked-layer
`[n_concepts, hidden_size]` projection matrices keyed as
`layer_<idx>.vectors` — the directions the runner projects residual
activations onto during each forward pass.

## Overview

Two scripts, run in sequence:

1. **`extract_training_pairs.py`** — turns the memory graph into a
   directory of (emotion, polarity, text) training examples.
   Positive examples are memory nodes where the emotion scored
   ≥ a threshold; negative examples are nodes where it's absent or
   low. Emotion tags come from the trailing `warmth:9 clarity:10 …`
   lines the subconscious agents emit.

2. **`train_steering_vectors.py`** — for each emotion, runs the
   target model over the positive and negative examples, captures
   residual-stream activations at the configured target layers, and
   computes `mean(positive) - mean(negative)` as the steering
   direction. Normalizes per-layer to unit length and saves the
   whole `[E, L, H]` matrix.

The output file is passed to vLLM via `VLLM_READOUT_VECTORS` together
with a `VLLM_READOUT_MANIFEST` JSON listing concepts and hooked layer
indices.

## Method

This is Contrastive Activation Addition (CAA, Rimsky et al.) applied
to naturally-occurring emotion labels rather than hand-crafted
contrast pairs. The shape of the signal we're recovering is "what
direction in the residual stream corresponds to the model processing
text-with-emotion-E vs. text-without". Because our training data was
generated by the very model we're instrumenting (past-self's journal
entries, digest nodes, pattern nodes), the signal should be unusually
clean — the emotion labels and the text are already causally linked
through a single model's forward pass.

## Usage (design — not yet runnable)

```
# Step 1: memory graph → training data
python -m training.amygdala_training.extract_training_pairs \
    --memory-mcp-url http://localhost:7777 \
    --output-dir /tmp/amygdala_training_data \
    --min-positive-score 8 \
    --max-negative-mentions 0 \
    --min-content-chars 40 \
    --max-examples-per-emotion 500

# Step 2: training data → steering vectors
python -m training.amygdala_training.train_steering_vectors \
    --model Qwen/Qwen3.5-27B \
    --training-data-dir /tmp/amygdala_training_data \
    --target-layers 3,18,33,36 \
    --output /path/to/amygdala_vectors.safetensors \
    --dtype bf16 \
    --batch-size 4
```

## Open questions

- **Emotion selection**: enumerating which ~200 emotions to cover.
  Could be "most-common tags in the graph" (data-driven) or "from
  core-personality / pattern nodes" (human-curated). Probably both.
- **Layer selection**: middle-to-late layers (~60–80% of depth)
  usually hold abstract semantic representations best; experiment
  with which layers give the cleanest linear separation per emotion.
- **Cross-talk**: if two emotions are highly co-occurring (warmth +
  love, frustration + tiredness), their vectors will be close; that's
  fine as long as we don't pretend they're independent axes.
- **Generalization**: vectors trained on our memory graph may not
  generalize to out-of-distribution text. Check by applying them to
  held-out conversation data and eyeballing the projections.