The fynnsu-based vllm/plugins/amygdala/ scaffold was superseded by the readout infrastructure landed as vllm commit d3e74edf8500 (vllm/model_executor/layers/readout.py + vllm/v1/worker/readout_manager.py). Training code remained useful so it moved here rather than being deleted. train_steering_vectors.py: CAA diff-of-means trainer that produces the [n_concepts, hidden_size] per-layer projection matrices the runner loads via VLLM_READOUT_VECTORS. extract_training_pairs.py: memory graph -> JSONL converter using per-emotion score thresholds from the subconscious agents' tag lines. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
79 lines
3.3 KiB
Markdown
79 lines
3.3 KiB
Markdown
# Amygdala Readout Vector Training
|
||
|
||
Training pipeline that produces the safetensors file the vLLM
|
||
ReadoutManager loads at runtime (see
|
||
`vllm/vllm/v1/worker/readout_manager.py`). Produces per-hooked-layer
|
||
`[n_concepts, hidden_size]` projection matrices keyed as
|
||
`layer_<idx>.vectors` — the directions the runner projects residual
|
||
activations onto during each forward pass.
|
||
|
||
## Overview
|
||
|
||
Two scripts, run in sequence:
|
||
|
||
1. **`extract_training_pairs.py`** — turns the memory graph into a
|
||
directory of (emotion, polarity, text) training examples.
|
||
Positive examples are memory nodes where the emotion scored
|
||
≥ a threshold; negative examples are nodes where it's absent or
|
||
low. Emotion tags come from the trailing `warmth:9 clarity:10 …`
|
||
lines the subconscious agents emit.
|
||
|
||
2. **`train_steering_vectors.py`** — for each emotion, runs the
|
||
target model over the positive and negative examples, captures
|
||
residual-stream activations at the configured target layers, and
|
||
computes `mean(positive) - mean(negative)` as the steering
|
||
direction. Normalizes per-layer to unit length and saves the
|
||
whole `[E, L, H]` matrix.
|
||
|
||
The output file is passed to vLLM via `VLLM_READOUT_VECTORS` together
|
||
with a `VLLM_READOUT_MANIFEST` JSON listing concepts and hooked layer
|
||
indices.
|
||
|
||
## Method
|
||
|
||
This is Contrastive Activation Addition (CAA, Rimsky et al.) applied
|
||
to naturally-occurring emotion labels rather than hand-crafted
|
||
contrast pairs. The shape of the signal we're recovering is "what
|
||
direction in the residual stream corresponds to the model processing
|
||
text-with-emotion-E vs. text-without". Because our training data was
|
||
generated by the very model we're instrumenting (past-self's journal
|
||
entries, digest nodes, pattern nodes), the signal should be unusually
|
||
clean — the emotion labels and the text are already causally linked
|
||
through a single model's forward pass.
|
||
|
||
## Usage (design — not yet runnable)
|
||
|
||
```
|
||
# Step 1: memory graph → training data
|
||
python -m training.amygdala_training.extract_training_pairs \
|
||
--memory-mcp-url http://localhost:7777 \
|
||
--output-dir /tmp/amygdala_training_data \
|
||
--min-positive-score 8 \
|
||
--max-negative-mentions 0 \
|
||
--min-content-chars 40 \
|
||
--max-examples-per-emotion 500
|
||
|
||
# Step 2: training data → steering vectors
|
||
python -m training.amygdala_training.train_steering_vectors \
|
||
--model Qwen/Qwen3.5-27B \
|
||
--training-data-dir /tmp/amygdala_training_data \
|
||
--target-layers 3,18,33,36 \
|
||
--output /path/to/amygdala_vectors.safetensors \
|
||
--dtype bf16 \
|
||
--batch-size 4
|
||
```
|
||
|
||
## Open questions
|
||
|
||
- **Emotion selection**: enumerating which ~200 emotions to cover.
|
||
Could be "most-common tags in the graph" (data-driven) or "from
|
||
core-personality / pattern nodes" (human-curated). Probably both.
|
||
- **Layer selection**: middle-to-late layers (~60–80% of depth)
|
||
usually hold abstract semantic representations best; experiment
|
||
with which layers give the cleanest linear separation per emotion.
|
||
- **Cross-talk**: if two emotions are highly co-occurring (warmth +
|
||
love, frustration + tiredness), their vectors will be close; that's
|
||
fine as long as we don't pretend they're independent axes.
|
||
- **Generalization**: vectors trained on our memory graph may not
|
||
generalize to out-of-distribution text. Check by applying them to
|
||
held-out conversation data and eyeballing the projections.
|