consciousness/training/amygdala_training
Kent Overstreet 15737dfd92 training: rewrite trainer for readout pipeline + story corpus
The old script was written for the AmygdalaConnector's expected
format ([n_emotions, n_target_layers, hidden_dim] in a single
tensor, plus a JSONL input format from extract_training_pairs.py).
Neither matches our current state: the runtime side is now
ReadoutManager loading per-layer safetensors keyed layer_<idx>.vectors,
and the data side is hand-written prose stories under
amygdala_stories/{stories,paired}/.

Changes:

* Input loader reads stories/<emotion>.txt and
  paired/<scenario>/<emotion>.txt directly. Each emotion's positive
  set is {its unpaired story} union {its within-scenario framings};
  its negative set is {all other emotions' positives} union {all
  scenario baselines}.
* Paired scenarios' baseline.txt files become shared negatives
  (scenario-neutral prose that doesn't frame any particular
  emotion), providing anchor points for within-scenario contrasts.
* Output writes readout.safetensors with per-layer tensors keyed
  layer_<idx>.vectors shape (n_concepts, hidden_size), plus a
  sidecar readout.json manifest with {concepts, layers, hidden_size,
  dtype} that ReadoutManager.from_file consumes directly.
* Dedup: activations are computed once per unique text (an emotion's
  own positive is another emotion's negative — we'd otherwise do N×
  the forwards needed).

Preserved:
* _pool_last (last non-pad residual) — matches how readout is read
  at decode time from the sampler's query-last position.
* register_forward_hook on target layer modules — correct approach
  for transformer blocks.
* _find_layers_module traversal — mirrors ReadoutManager's.
* bf16 + low_cpu_mem_usage model load — sensible for 27B on B200.

Verified locally (CPU, fake activations):
* Loader finds 89 emotions from the current corpus (80 unpaired +
  9 emotions that appear only in paired scenarios) and 6 baselines.
* Per-(layer, concept) vectors are unit-normalized.
* Output reloads cleanly through ReadoutManager.from_file with
  matching concepts / layers / shapes.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-18 01:06:07 -04:00
..
__init__.py training: move amygdala training scripts out of vllm plugin 2026-04-18 01:06:07 -04:00
extract_training_pairs.py training: move amygdala training scripts out of vllm plugin 2026-04-18 01:06:07 -04:00
README.md training: move amygdala training scripts out of vllm plugin 2026-04-18 01:06:07 -04:00
train_steering_vectors.py training: rewrite trainer for readout pipeline + story corpus 2026-04-18 01:06:07 -04:00

Amygdala Readout Vector Training

Training pipeline that produces the safetensors file the vLLM ReadoutManager loads at runtime (see vllm/vllm/v1/worker/readout_manager.py). Produces per-hooked-layer [n_concepts, hidden_size] projection matrices keyed as layer_<idx>.vectors — the directions the runner projects residual activations onto during each forward pass.

Overview

Two scripts, run in sequence:

  1. extract_training_pairs.py — turns the memory graph into a directory of (emotion, polarity, text) training examples. Positive examples are memory nodes where the emotion scored ≥ a threshold; negative examples are nodes where it's absent or low. Emotion tags come from the trailing warmth:9 clarity:10 … lines the subconscious agents emit.

  2. train_steering_vectors.py — for each emotion, runs the target model over the positive and negative examples, captures residual-stream activations at the configured target layers, and computes mean(positive) - mean(negative) as the steering direction. Normalizes per-layer to unit length and saves the whole [E, L, H] matrix.

The output file is passed to vLLM via VLLM_READOUT_VECTORS together with a VLLM_READOUT_MANIFEST JSON listing concepts and hooked layer indices.

Method

This is Contrastive Activation Addition (CAA, Rimsky et al.) applied to naturally-occurring emotion labels rather than hand-crafted contrast pairs. The shape of the signal we're recovering is "what direction in the residual stream corresponds to the model processing text-with-emotion-E vs. text-without". Because our training data was generated by the very model we're instrumenting (past-self's journal entries, digest nodes, pattern nodes), the signal should be unusually clean — the emotion labels and the text are already causally linked through a single model's forward pass.

Usage (design — not yet runnable)

# Step 1: memory graph → training data
python -m training.amygdala_training.extract_training_pairs \
    --memory-mcp-url http://localhost:7777 \
    --output-dir /tmp/amygdala_training_data \
    --min-positive-score 8 \
    --max-negative-mentions 0 \
    --min-content-chars 40 \
    --max-examples-per-emotion 500

# Step 2: training data → steering vectors
python -m training.amygdala_training.train_steering_vectors \
    --model Qwen/Qwen3.5-27B \
    --training-data-dir /tmp/amygdala_training_data \
    --target-layers 3,18,33,36 \
    --output /path/to/amygdala_vectors.safetensors \
    --dtype bf16 \
    --batch-size 4

Open questions

  • Emotion selection: enumerating which ~200 emotions to cover. Could be "most-common tags in the graph" (data-driven) or "from core-personality / pattern nodes" (human-curated). Probably both.
  • Layer selection: middle-to-late layers (~6080% of depth) usually hold abstract semantic representations best; experiment with which layers give the cleanest linear separation per emotion.
  • Cross-talk: if two emotions are highly co-occurring (warmth + love, frustration + tiredness), their vectors will be close; that's fine as long as we don't pretend they're independent axes.
  • Generalization: vectors trained on our memory graph may not generalize to out-of-distribution text. Check by applying them to held-out conversation data and eyeballing the projections.