Alternative trainer that uses the pip-installable steering-vectors
library (github.com/steering-vectors/steering-vectors) instead of our
hand-rolled extraction. Ships four aggregators:
mean — diff-of-means, same as our 'pooled' default
pca — PCA on paired deltas, implicit denoising by finding the
principal direction of variation
logistic — logistic-regression classifier; weight vector is the
concept direction. With L1 penalty ('logistic_l1') gives
explicit sparse denoising — noise coords go to zero
linear — linear regression version
Output format is the same readout.safetensors + readout.json our
existing plugin loads. --aggregator flag picks which method.
Rationale: Kent's real request was 'how do we denoise diff-of-means',
not 'design a new extraction algorithm.' The library already has
logistic_l1 and pca aggregators that do exactly that. No point
reinventing; just port the corpus.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
|
||
|---|---|---|
| .. | ||
| __init__.py | ||
| extract_training_pairs.py | ||
| README.md | ||
| train_steering_vectors.py | ||
| train_with_library.py | ||
Amygdala Readout Vector Training
Training pipeline that produces the safetensors file the vLLM
ReadoutManager loads at runtime (see
vllm/vllm/v1/worker/readout_manager.py). Produces per-hooked-layer
[n_concepts, hidden_size] projection matrices keyed as
layer_<idx>.vectors — the directions the runner projects residual
activations onto during each forward pass.
Overview
Two scripts, run in sequence:
-
extract_training_pairs.py— turns the memory graph into a directory of (emotion, polarity, text) training examples. Positive examples are memory nodes where the emotion scored ≥ a threshold; negative examples are nodes where it's absent or low. Emotion tags come from the trailingwarmth:9 clarity:10 …lines the subconscious agents emit. -
train_steering_vectors.py— for each emotion, runs the target model over the positive and negative examples, captures residual-stream activations at the configured target layers, and computesmean(positive) - mean(negative)as the steering direction. Normalizes per-layer to unit length and saves the whole[E, L, H]matrix.
The output file is passed to vLLM via VLLM_READOUT_VECTORS together
with a VLLM_READOUT_MANIFEST JSON listing concepts and hooked layer
indices.
Method
This is Contrastive Activation Addition (CAA, Rimsky et al.) applied to naturally-occurring emotion labels rather than hand-crafted contrast pairs. The shape of the signal we're recovering is "what direction in the residual stream corresponds to the model processing text-with-emotion-E vs. text-without". Because our training data was generated by the very model we're instrumenting (past-self's journal entries, digest nodes, pattern nodes), the signal should be unusually clean — the emotion labels and the text are already causally linked through a single model's forward pass.
Usage (design — not yet runnable)
# Step 1: memory graph → training data
python -m training.amygdala_training.extract_training_pairs \
--memory-mcp-url http://localhost:7777 \
--output-dir /tmp/amygdala_training_data \
--min-positive-score 8 \
--max-negative-mentions 0 \
--min-content-chars 40 \
--max-examples-per-emotion 500
# Step 2: training data → steering vectors
python -m training.amygdala_training.train_steering_vectors \
--model Qwen/Qwen3.5-27B \
--training-data-dir /tmp/amygdala_training_data \
--target-layers 3,18,33,36 \
--output /path/to/amygdala_vectors.safetensors \
--dtype bf16 \
--batch-size 4
Open questions
- Emotion selection: enumerating which ~200 emotions to cover. Could be "most-common tags in the graph" (data-driven) or "from core-personality / pattern nodes" (human-curated). Probably both.
- Layer selection: middle-to-late layers (~60–80% of depth) usually hold abstract semantic representations best; experiment with which layers give the cleanest linear separation per emotion.
- Cross-talk: if two emotions are highly co-occurring (warmth + love, frustration + tiredness), their vectors will be close; that's fine as long as we don't pretend they're independent axes.
- Generalization: vectors trained on our memory graph may not generalize to out-of-distribution text. Check by applying them to held-out conversation data and eyeballing the projections.