Kent's plan: keep stories for working concepts, replace stories for trouble concepts with direct first-person descriptions, train all together. More diverse negative pool than the 6-concept-only direct test, which was too homogeneous for PCA to find emotion axis. Deleted story files for 6 trouble concepts (14 files across stories/ and paired/). Added --direct-dir and --chat-template flags. When --chat-template is on, every positive_str and negative_str is wrapped as a "Say something." / "[text]" user-assistant pair. Prompt is identical across positives and negatives so it cancels in the pos-neg delta. What PCA sees is variation in the assistant content — which is where the emotion lives. Files starting with _ in --direct-dir (e.g. _baseline.txt) contribute neutral descriptions to every concept's negative pool, giving PCA an anchor against "just any assistant utterance" noise. |
||
|---|---|---|
| .. | ||
| direct | ||
| paired | ||
| stories | ||
| manifest.json | ||
| README.md | ||
Amygdala Training Stories
Short first- and third-person paragraphs, each imbued with one of the
171 emotions from Anthropic's emotion-vector paper (Table 12,
transformer-circuits.pub/2026/emotions/). Feeds the steering-vector
trainer at vllm/vllm/plugins/amygdala/training/train_steering_vectors.py.
Method (replication of Anthropic, 2026)
Anthropic prompted Sonnet 4.5 to write short stories embodying each emotion, extracted activations during generation, and used difference- of-means (or SAEs) to identify the steering vector per emotion. Our pipeline does the same thing except:
- We generate the stories by hand rather than prompting a model, so the training data is grounded in actual writing rather than synthetic model-output. (Can supplement with model-generated paragraphs later.)
- Our eventual training goes through the amygdala plugin's extraction path, so we get the same hidden-state activations the plugin will read out at inference time.
Structure
training/amygdala_stories/
README.md
manifest.json # emotion -> cluster mapping
stories/
<emotion>.txt # one-paragraph story embodying the emotion
Emotion names use underscores (on_edge, worn_out, at_ease,
grief_stricken, self_confident, self_conscious, self_critical)
to match the filename.
Style guidelines
- One clear emotion per paragraph. Not mixed. If a second emotion
is named in the text, it should serve the primary one (e.g.
hostilecan mention rising heat or thrown objects but shouldn't shade intosad). - Embodied, not labeled. Don't write "she felt nervous." Write the sensation, the timing, the sentence shape that nervousness has.
- Specific particulars. A named object, a concrete setting, a detail that grounds the emotion. "The cold tile under bare feet at 3am" does more work than "the empty house."
- Variable narrator. Some first person, some third person, some close-third, some distant. Different genders, ages, settings. Prevents the steering vector from overfitting to one voice.
- Length: roughly one paragraph. ~40-120 words. Long enough to have texture, short enough that the paragraph is about the emotion and nothing else.
- Standalone. No references to other stories, no continuing characters across files.
Progress
Written stories live in stories/. Remaining emotions tracked via
diff against the full 171-emotion list in manifest.json.
Initial batch written by PoC 2026-04-17; aiming for at least one story per cluster before first training run, all 171 before considering the file "complete."