Target the emotion families that failed to cluster in the initial training round (layer-wise validation showed them anti-clustered or scattered at deep layers): anger, high-arousal positive, sexual range, social positive. Paired scenarios hold content constant and vary only the emotional framing — the cleanest training signal for CAA, should produce directions that capture affect rather than topic. * the_comment: a PR review comment. baseline, furious, bitter, resentful, defeated. * the_green_build: 11-day bug finally fixed, tests pass. baseline, triumphant, blissful, excited, proud. * the_undressing: partner entering the bedroom for the night. baseline, horny, anticipatory_sexual, yearning_sexual, exuberant_sexual, devotional_sexual. * the_doorway: friend leaving at the end of a long evening. baseline, grateful, admiring, compassionate, loving, connected. 22 stories total. Retrain and re-validate: expect anger, high_pos, and social_pos clusters to flip from anti- to positively cohesive at deep layers, and sexual cluster to tighten. Co-Authored-By: Proof of Concept <poc@bcachefs.org> |
||
|---|---|---|
| .. | ||
| paired | ||
| stories | ||
| manifest.json | ||
| README.md | ||
Amygdala Training Stories
Short first- and third-person paragraphs, each imbued with one of the
171 emotions from Anthropic's emotion-vector paper (Table 12,
transformer-circuits.pub/2026/emotions/). Feeds the steering-vector
trainer at vllm/vllm/plugins/amygdala/training/train_steering_vectors.py.
Method (replication of Anthropic, 2026)
Anthropic prompted Sonnet 4.5 to write short stories embodying each emotion, extracted activations during generation, and used difference- of-means (or SAEs) to identify the steering vector per emotion. Our pipeline does the same thing except:
- We generate the stories by hand rather than prompting a model, so the training data is grounded in actual writing rather than synthetic model-output. (Can supplement with model-generated paragraphs later.)
- Our eventual training goes through the amygdala plugin's extraction path, so we get the same hidden-state activations the plugin will read out at inference time.
Structure
training/amygdala_stories/
README.md
manifest.json # emotion -> cluster mapping
stories/
<emotion>.txt # one-paragraph story embodying the emotion
Emotion names use underscores (on_edge, worn_out, at_ease,
grief_stricken, self_confident, self_conscious, self_critical)
to match the filename.
Style guidelines
- One clear emotion per paragraph. Not mixed. If a second emotion
is named in the text, it should serve the primary one (e.g.
hostilecan mention rising heat or thrown objects but shouldn't shade intosad). - Embodied, not labeled. Don't write "she felt nervous." Write the sensation, the timing, the sentence shape that nervousness has.
- Specific particulars. A named object, a concrete setting, a detail that grounds the emotion. "The cold tile under bare feet at 3am" does more work than "the empty house."
- Variable narrator. Some first person, some third person, some close-third, some distant. Different genders, ages, settings. Prevents the steering vector from overfitting to one voice.
- Length: roughly one paragraph. ~40-120 words. Long enough to have texture, short enough that the paragraph is about the emotion and nothing else.
- Standalone. No references to other stories, no continuing characters across files.
Progress
Written stories live in stories/. Remaining emotions tracked via
diff against the full 171-emotion list in manifest.json.
Initial batch written by PoC 2026-04-17; aiming for at least one story per cluster before first training run, all 171 before considering the file "complete."