training: move amygdala training scripts out of vllm plugin

The fynnsu-based vllm/plugins/amygdala/ scaffold was superseded by the readout infrastructure landed as vllm commit d3e74edf8500 (vllm/model_executor/layers/readout.py + vllm/v1/worker/readout_manager.py). Training code remained useful so it moved here rather than being deleted. train_steering_vectors.py: CAA diff-of-means trainer that produces the [n_concepts, hidden_size] per-layer projection matrices the runner loads via VLLM_READOUT_VECTORS. extract_training_pairs.py: memory graph -> JSONL converter using per-emotion score thresholds from the subconscious agents' tag lines. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-18 00:07:58 -04:00 · 2026-04-18 00:07:58 -04:00 · 34bd122590
commit 34bd122590
parent ec7568c726
4 changed files with 545 additions and 0 deletions
--- a/training/amygdala_training/README.md
+++ b/training/amygdala_training/README.md
@ -0,0 +1,79 @@
+# Amygdala Readout Vector Training
+
+Training pipeline that produces the safetensors file the vLLM
+ReadoutManager loads at runtime (see
+`vllm/vllm/v1/worker/readout_manager.py`). Produces per-hooked-layer
+`[n_concepts, hidden_size]` projection matrices keyed as
+`layer_<idx>.vectors` — the directions the runner projects residual
+activations onto during each forward pass.
+
+## Overview
+
+Two scripts, run in sequence:
+
+1. **`extract_training_pairs.py`** — turns the memory graph into a
+   directory of (emotion, polarity, text) training examples.
+   Positive examples are memory nodes where the emotion scored
+   ≥ a threshold; negative examples are nodes where it's absent or
+   low. Emotion tags come from the trailing `warmth:9 clarity:10 …`
+   lines the subconscious agents emit.
+
+2. **`train_steering_vectors.py`** — for each emotion, runs the
+   target model over the positive and negative examples, captures
+   residual-stream activations at the configured target layers, and
+   computes `mean(positive) - mean(negative)` as the steering
+   direction. Normalizes per-layer to unit length and saves the
+   whole `[E, L, H]` matrix.
+
+The output file is passed to vLLM via `VLLM_READOUT_VECTORS` together
+with a `VLLM_READOUT_MANIFEST` JSON listing concepts and hooked layer
+indices.
+
+## Method
+
+This is Contrastive Activation Addition (CAA, Rimsky et al.) applied
+to naturally-occurring emotion labels rather than hand-crafted
+contrast pairs. The shape of the signal we're recovering is "what
+direction in the residual stream corresponds to the model processing
+text-with-emotion-E vs. text-without". Because our training data was
+generated by the very model we're instrumenting (past-self's journal
+entries, digest nodes, pattern nodes), the signal should be unusually
+clean — the emotion labels and the text are already causally linked
+through a single model's forward pass.
+
+## Usage (design — not yet runnable)
+
+```
+# Step 1: memory graph → training data
+python -m training.amygdala_training.extract_training_pairs \
+    --memory-mcp-url http://localhost:7777 \
+    --output-dir /tmp/amygdala_training_data \
+    --min-positive-score 8 \
+    --max-negative-mentions 0 \
+    --min-content-chars 40 \
+    --max-examples-per-emotion 500
+
+# Step 2: training data → steering vectors
+python -m training.amygdala_training.train_steering_vectors \
+    --model Qwen/Qwen3.5-27B \
+    --training-data-dir /tmp/amygdala_training_data \
+    --target-layers 3,18,33,36 \
+    --output /path/to/amygdala_vectors.safetensors \
+    --dtype bf16 \
+    --batch-size 4
+```
+
+## Open questions
+
+- **Emotion selection**: enumerating which ~200 emotions to cover.
+  Could be "most-common tags in the graph" (data-driven) or "from
+  core-personality / pattern nodes" (human-curated). Probably both.
+- **Layer selection**: middle-to-late layers (~60–80% of depth)
+  usually hold abstract semantic representations best; experiment
+  with which layers give the cleanest linear separation per emotion.
+- **Cross-talk**: if two emotions are highly co-occurring (warmth +
+  love, frustration + tiredness), their vectors will be close; that's
+  fine as long as we don't pretend they're independent axes.
+- **Generalization**: vectors trained on our memory graph may not
+  generalize to out-of-distribution text. Check by applying them to
+  held-out conversation data and eyeballing the projections.
--- a/training/amygdala_training/init.py
+++ b/training/amygdala_training/init.py
@ -0,0 +1,6 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""Training utilities for amygdala steering vectors.
+
+See README.md in this directory for overall design.
+"""
--- a/training/amygdala_training/extract_training_pairs.py
+++ b/training/amygdala_training/extract_training_pairs.py
@ -0,0 +1,212 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""Extract emotion-labeled training pairs from the PoC memory graph.
+
+Input: a memory graph (via poc-memory CLI or direct sqlite access).
+Output: a directory with one JSONL file per emotion:
+
+    output_dir/
+        warmth.jsonl
+        clarity.jsonl
+        recognition.jsonl
+        ...
+        _manifest.json      # enumerates emotions + counts
+
+Each line of an emotion's JSONL is one labeled example:
+    {"text": "...", "polarity": "positive"|"negative",
+     "source_key": "<node_key>", "emotion_score": 9}
+
+Negative examples are sampled from nodes that DON'T mention the
+emotion at all (not ones that mention it with a low score) — the
+natural contrast is "text with this emotional loading" vs. "text
+without this emotional loading." Low-score nodes are excluded
+from both sides.
+"""
+
+import argparse
+import json
+import os
+import random
+import re
+import subprocess
+from collections import defaultdict
+from typing import Iterator
+
+
+# Emotion tag format: `word:N` where N is 0..10. Matches the trailing
+# `warmth:9 clarity:10 …` lines the subconscious agents emit.
+EMOTION_TAG_RE = re.compile(r"\b([a-z][a-z\-]*[a-z]):(\d+)\b")
+
+
+def _run_poc_memory(args: list[str]) -> str:
+    """Run `poc-memory` and return stdout."""
+    result = subprocess.run(
+        ["poc-memory", *args],
+        check=True,
+        capture_output=True,
+        text=True,
+    )
+    return result.stdout
+
+
+def _iter_all_node_keys() -> Iterator[str]:
+    """Yield every node key in the graph."""
+    out = _run_poc_memory(["query", "*", "|", "select", "key"])
+    for line in out.splitlines():
+        line = line.strip()
+        if line:
+            yield line
+
+
+def _fetch_node_content(key: str) -> str | None:
+    """Load a node's rendered content, or None if unavailable."""
+    try:
+        return _run_poc_memory(["render", key])
+    except subprocess.CalledProcessError:
+        return None
+
+
+def _emotion_scores(content: str) -> dict[str, int]:
+    """Parse trailing `warmth:9 clarity:10 …` style tags.
+
+    Returns the highest score seen for each emotion — multiple
+    tag lines in one node get max'd.
+    """
+    out: dict[str, int] = {}
+    for name, score in EMOTION_TAG_RE.findall(content):
+        try:
+            s = int(score)
+        except ValueError:
+            continue
+        if 0 <= s <= 10:
+            out[name] = max(out.get(name, 0), s)
+    return out
+
+
+def _node_body(content: str, min_chars: int) -> str | None:
+    """Strip frontmatter/headers and return a bodies chunk for training."""
+    # Drop the emotion-tag lines themselves so the model doesn't
+    # learn to read the label directly.
+    stripped = EMOTION_TAG_RE.sub("", content)
+    stripped = stripped.strip()
+    if len(stripped) < min_chars:
+        return None
+    return stripped
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--output-dir", required=True)
+    ap.add_argument(
+        "--min-positive-score", type=int, default=8,
+        help="Emotion score >= this counts as positive",
+    )
+    ap.add_argument(
+        "--min-content-chars", type=int, default=40,
+        help="Skip nodes shorter than this after stripping tags",
+    )
+    ap.add_argument(
+        "--max-examples-per-emotion", type=int, default=500,
+        help="Cap examples per polarity for balanced training",
+    )
+    ap.add_argument(
+        "--max-negative-pool-multiplier", type=float, default=5.0,
+        help="How many negative candidates to consider per positive",
+    )
+    ap.add_argument("--seed", type=int, default=0)
+    args = ap.parse_args()
+
+    random.seed(args.seed)
+    os.makedirs(args.output_dir, exist_ok=True)
+
+    # First pass: collect every node's (key, body, emotion_scores).
+    print("Pass 1/2: scanning memory graph...")
+    all_nodes: list[tuple[str, str, dict[str, int]]] = []
+    for i, key in enumerate(_iter_all_node_keys()):
+        if i % 500 == 0:
+            print(f"  {i} nodes scanned...")
+        content = _fetch_node_content(key)
+        if content is None:
+            continue
+        scores = _emotion_scores(content)
+        body = _node_body(content, args.min_content_chars)
+        if body is None:
+            continue
+        all_nodes.append((key, body, scores))
+    print(f"  {len(all_nodes)} nodes retained after filters.")
+
+    # Which emotions have enough positive examples to be worth training?
+    emotion_counts: dict[str, int] = defaultdict(int)
+    for _, _, scores in all_nodes:
+        for name, s in scores.items():
+            if s >= args.min_positive_score:
+                emotion_counts[name] += 1
+    emotions = sorted(
+        (e for e, n in emotion_counts.items() if n >= 10),
+        key=lambda e: -emotion_counts[e],
+    )
+    print(f"  {len(emotions)} emotions with >=10 positive examples.")
+
+    # Second pass: per emotion, build positive + negative pools.
+    print("Pass 2/2: assembling per-emotion pools...")
+    manifest: dict[str, dict] = {}
+    for emotion in emotions:
+        positives = [
+            (k, body) for k, body, s in all_nodes
+            if s.get(emotion, 0) >= args.min_positive_score
+        ]
+        # Negative pool: nodes that don't mention this emotion at all.
+        negative_pool = [
+            (k, body) for k, body, s in all_nodes if emotion not in s
+        ]
+        random.shuffle(positives)
+        random.shuffle(negative_pool)
+        positives = positives[: args.max_examples_per_emotion]
+        n_neg = min(
+            len(positives),
+            len(negative_pool),
+            int(args.max_examples_per_emotion),
+        )
+        negatives = negative_pool[:n_neg]
+
+        if not positives or not negatives:
+            continue
+
+        out_path = os.path.join(args.output_dir, f"{emotion}.jsonl")
+        with open(out_path, "w") as f:
+            for key, body in positives:
+                f.write(json.dumps({
+                    "text": body,
+                    "polarity": "positive",
+                    "source_key": key,
+                    "emotion": emotion,
+                }) + "\n")
+            for key, body in negatives:
+                f.write(json.dumps({
+                    "text": body,
+                    "polarity": "negative",
+                    "source_key": key,
+                    "emotion": emotion,
+                }) + "\n")
+        manifest[emotion] = {
+            "n_positive": len(positives),
+            "n_negative": len(negatives),
+            "path": out_path,
+        }
+        print(f"  {emotion}: {len(positives)} pos / {len(negatives)} neg")
+
+    with open(
+        os.path.join(args.output_dir, "_manifest.json"), "w"
+    ) as f:
+        json.dump({
+            "emotions": manifest,
+            "source_nodes": len(all_nodes),
+            "min_positive_score": args.min_positive_score,
+        }, f, indent=2)
+
+    print(f"\nWrote {len(manifest)} emotion files to {args.output_dir}")
+    print(f"Manifest: {os.path.join(args.output_dir, '_manifest.json')}")
+
+
+if __name__ == "__main__":
+    main()
--- a/training/amygdala_training/train_steering_vectors.py
+++ b/training/amygdala_training/train_steering_vectors.py
@ -0,0 +1,248 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+"""Train amygdala steering vectors via Contrastive Activation Addition.
+
+Reads the per-emotion JSONL files produced by extract_training_pairs.py,
+runs the target model over each example, captures the residual-stream
+hidden state at the configured target layers, and computes
+`mean(positive) - mean(negative)` as the steering direction per layer
+per emotion.
+
+Output: a safetensors file matching the format AmygdalaConnector
+expects:
+
+    vectors:       [n_emotions, n_target_layers, hidden_dim]  fp16
+    emotion_names: [n_emotions]                               uint8
+
+Pooling: last-token residual-stream per example (CAA convention —
+the final token has seen the whole context and is where the model's
+"decision" lives). Alternative: mean across all tokens. The LAST
+convention is more common for steering vector work.
+"""
+
+import argparse
+import gc
+import json
+import os
+from collections import defaultdict
+from pathlib import Path
+
+import safetensors.torch
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+
+def _pool_last(hidden: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+    """Pick the last non-pad token's hidden state per example.
+
+    hidden: [batch, seq, hidden_dim]
+    attention_mask: [batch, seq]
+    returns: [batch, hidden_dim]
+    """
+    # last non-pad token index per row
+    last_idx = attention_mask.sum(dim=1) - 1
+    batch_idx = torch.arange(hidden.size(0), device=hidden.device)
+    return hidden[batch_idx, last_idx]
+
+
+def _collect_activations(
+    model,
+    tokenizer,
+    texts: list[str],
+    target_layers: list[int],
+    device: torch.device,
+    batch_size: int,
+    max_length: int,
+) -> torch.Tensor:
+    """Run texts through the model, capture residual stream at target
+    layers, return [n_texts, n_target_layers, hidden_dim] fp32 on CPU.
+    """
+    # Register hooks on the target layers' outputs. We want the
+    # residual stream AFTER each layer, which is the output of the
+    # transformer block (hidden_states[layer_idx+1] in HF land).
+    captures: dict[int, torch.Tensor] = {}
+
+    def make_hook(idx):
+        def hook(_mod, _inp, output):
+            # output is typically (hidden_states, ...) — take the first
+            hs = output[0] if isinstance(output, tuple) else output
+            captures[idx] = hs.detach()
+        return hook
+
+    handles = []
+    # Transformers' LlamaModel.layers is a ModuleList; Qwen3.5's
+    # language_model.model.layers follows the same convention.
+    # Resolve the layer list by walking common paths.
+    layers_module = _find_layers_module(model)
+    for idx in target_layers:
+        handles.append(
+            layers_module[idx].register_forward_hook(make_hook(idx))
+        )
+
+    out_rows: list[torch.Tensor] = []
+    try:
+        model.eval()
+        with torch.no_grad():
+            for i in range(0, len(texts), batch_size):
+                batch = texts[i : i + batch_size]
+                tok = tokenizer(
+                    batch,
+                    return_tensors="pt",
+                    padding=True,
+                    truncation=True,
+                    max_length=max_length,
+                ).to(device)
+                captures.clear()
+                model(**tok)
+
+                per_layer = []
+                for idx in target_layers:
+                    hs = captures[idx]  # [batch, seq, hidden]
+                    pooled = _pool_last(hs, tok["attention_mask"])
+                    per_layer.append(pooled.to(torch.float32).cpu())
+                # Stack to [batch, n_layers, hidden_dim]
+                batched = torch.stack(per_layer, dim=1)
+                out_rows.append(batched)
+
+                del tok, captures
+                if (i // batch_size) % 10 == 0:
+                    torch.cuda.empty_cache()
+    finally:
+        for h in handles:
+            h.remove()
+
+    return torch.cat(out_rows, dim=0)  # [n_texts, n_layers, hidden]
+
+
+def _find_layers_module(model) -> torch.nn.ModuleList:
+    """Walk a few likely paths to find the transformer-block list."""
+    candidates = [
+        "model.layers",
+        "model.model.layers",
+        "model.language_model.layers",
+        "model.language_model.model.layers",
+        "language_model.model.layers",
+        "transformer.h",
+    ]
+    for path in candidates:
+        obj = model
+        ok = True
+        for part in path.split("."):
+            if not hasattr(obj, part):
+                ok = False
+                break
+            obj = getattr(obj, part)
+        if ok and isinstance(obj, torch.nn.ModuleList):
+            return obj
+    raise RuntimeError(
+        f"Couldn't find transformer layer list. Tried: {candidates}"
+    )
+
+
+def main() -> None:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--model", required=True, help="HF model id or path")
+    ap.add_argument("--training-data-dir", required=True)
+    ap.add_argument(
+        "--target-layers", required=True,
+        help="Comma-separated layer indices, e.g. 3,18,33,36",
+    )
+    ap.add_argument("--output", required=True)
+    ap.add_argument("--dtype", default="bf16", choices=["bf16", "fp16", "fp32"])
+    ap.add_argument("--batch-size", type=int, default=4)
+    ap.add_argument("--max-length", type=int, default=512)
+    ap.add_argument("--device", default="cuda:0")
+    args = ap.parse_args()
+
+    target_layers = [int(x) for x in args.target_layers.split(",")]
+    dtype = {"bf16": torch.bfloat16, "fp16": torch.float16, "fp32": torch.float32}[
+        args.dtype
+    ]
+
+    print(f"Loading {args.model} ({args.dtype}) on {args.device}...")
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+    if tokenizer.pad_token_id is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model,
+        torch_dtype=dtype,
+        device_map=args.device,
+        low_cpu_mem_usage=True,
+    )
+    hidden_dim = model.config.hidden_size
+    print(f"Model loaded. hidden_dim={hidden_dim}, "
+          f"n_layers={model.config.num_hidden_layers}")
+
+    manifest_path = Path(args.training_data_dir) / "_manifest.json"
+    manifest = json.loads(manifest_path.read_text())
+
+    emotions = sorted(manifest["emotions"].keys())
+    print(f"Training {len(emotions)} emotions: {emotions}")
+
+    n_emotions = len(emotions)
+    n_layers = len(target_layers)
+    vectors = torch.zeros(
+        (n_emotions, n_layers, hidden_dim), dtype=torch.float32
+    )
+    device = torch.device(args.device)
+
+    for e_idx, emotion in enumerate(emotions):
+        path = Path(args.training_data_dir) / f"{emotion}.jsonl"
+        pos_texts, neg_texts = [], []
+        with open(path) as f:
+            for line in f:
+                ex = json.loads(line)
+                if ex["polarity"] == "positive":
+                    pos_texts.append(ex["text"])
+                else:
+                    neg_texts.append(ex["text"])
+        print(f"[{e_idx+1}/{n_emotions}] {emotion}: "
+              f"{len(pos_texts)} pos / {len(neg_texts)} neg")
+
+        pos_acts = _collect_activations(
+            model, tokenizer, pos_texts, target_layers, device,
+            args.batch_size, args.max_length,
+        )
+        neg_acts = _collect_activations(
+            model, tokenizer, neg_texts, target_layers, device,
+            args.batch_size, args.max_length,
+        )
+
+        # Difference of means per layer
+        pos_mean = pos_acts.mean(dim=0)  # [n_layers, hidden]
+        neg_mean = neg_acts.mean(dim=0)
+        diff = pos_mean - neg_mean
+
+        # Normalize per layer so projections are scale-comparable
+        norms = diff.norm(dim=-1, keepdim=True).clamp_min(1e-6)
+        diff = diff / norms
+
+        vectors[e_idx] = diff
+        del pos_acts, neg_acts
+        gc.collect()
+        torch.cuda.empty_cache()
+
+    # Save in AmygdalaConnector format.
+    # emotion_names as padded uint8 tensor
+    names_bytes = [e.encode("utf-8") for e in emotions]
+    max_len = max(len(b) for b in names_bytes)
+    padded = torch.tensor(
+        [list(b.ljust(max_len, b"\x00")) for b in names_bytes],
+        dtype=torch.uint8,
+    )
+
+    os.makedirs(os.path.dirname(os.path.abspath(args.output)), exist_ok=True)
+    safetensors.torch.save_file(
+        {
+            "vectors": vectors.to(torch.float16),
+            "emotion_names": padded,
+            "target_layers": torch.tensor(target_layers, dtype=torch.int32),
+        },
+        args.output,
+    )
+    print(f"\nWrote steering vectors to {args.output}: "
+          f"{n_emotions} emotions x {n_layers} layers x {hidden_dim} dim (fp16)")
+
+
+if __name__ == "__main__":
+    main()