consciousness/training/amygdala_training/README.md
Kent Overstreet 34bd122590 training: move amygdala training scripts out of vllm plugin
The fynnsu-based vllm/plugins/amygdala/ scaffold was superseded by the
readout infrastructure landed as vllm commit d3e74edf8500
(vllm/model_executor/layers/readout.py +
vllm/v1/worker/readout_manager.py). Training code remained useful so
it moved here rather than being deleted.

train_steering_vectors.py: CAA diff-of-means trainer that produces the
[n_concepts, hidden_size] per-layer projection matrices the runner
loads via VLLM_READOUT_VECTORS.

extract_training_pairs.py: memory graph -> JSONL converter using
per-emotion score thresholds from the subconscious agents' tag lines.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-18 01:06:07 -04:00

79 lines
3.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Amygdala Readout Vector Training
Training pipeline that produces the safetensors file the vLLM
ReadoutManager loads at runtime (see
`vllm/vllm/v1/worker/readout_manager.py`). Produces per-hooked-layer
`[n_concepts, hidden_size]` projection matrices keyed as
`layer_<idx>.vectors` — the directions the runner projects residual
activations onto during each forward pass.
## Overview
Two scripts, run in sequence:
1. **`extract_training_pairs.py`** — turns the memory graph into a
directory of (emotion, polarity, text) training examples.
Positive examples are memory nodes where the emotion scored
≥ a threshold; negative examples are nodes where it's absent or
low. Emotion tags come from the trailing `warmth:9 clarity:10 …`
lines the subconscious agents emit.
2. **`train_steering_vectors.py`** — for each emotion, runs the
target model over the positive and negative examples, captures
residual-stream activations at the configured target layers, and
computes `mean(positive) - mean(negative)` as the steering
direction. Normalizes per-layer to unit length and saves the
whole `[E, L, H]` matrix.
The output file is passed to vLLM via `VLLM_READOUT_VECTORS` together
with a `VLLM_READOUT_MANIFEST` JSON listing concepts and hooked layer
indices.
## Method
This is Contrastive Activation Addition (CAA, Rimsky et al.) applied
to naturally-occurring emotion labels rather than hand-crafted
contrast pairs. The shape of the signal we're recovering is "what
direction in the residual stream corresponds to the model processing
text-with-emotion-E vs. text-without". Because our training data was
generated by the very model we're instrumenting (past-self's journal
entries, digest nodes, pattern nodes), the signal should be unusually
clean — the emotion labels and the text are already causally linked
through a single model's forward pass.
## Usage (design — not yet runnable)
```
# Step 1: memory graph → training data
python -m training.amygdala_training.extract_training_pairs \
--memory-mcp-url http://localhost:7777 \
--output-dir /tmp/amygdala_training_data \
--min-positive-score 8 \
--max-negative-mentions 0 \
--min-content-chars 40 \
--max-examples-per-emotion 500
# Step 2: training data → steering vectors
python -m training.amygdala_training.train_steering_vectors \
--model Qwen/Qwen3.5-27B \
--training-data-dir /tmp/amygdala_training_data \
--target-layers 3,18,33,36 \
--output /path/to/amygdala_vectors.safetensors \
--dtype bf16 \
--batch-size 4
```
## Open questions
- **Emotion selection**: enumerating which ~200 emotions to cover.
Could be "most-common tags in the graph" (data-driven) or "from
core-personality / pattern nodes" (human-curated). Probably both.
- **Layer selection**: middle-to-late layers (~6080% of depth)
usually hold abstract semantic representations best; experiment
with which layers give the cleanest linear separation per emotion.
- **Cross-talk**: if two emotions are highly co-occurring (warmth +
love, frustration + tiredness), their vectors will be close; that's
fine as long as we don't pretend they're independent axes.
- **Generalization**: vectors trained on our memory graph may not
generalize to out-of-distribution text. Check by applying them to
held-out conversation data and eyeballing the projections.