training/amygdala_stories: scaffold + initial batch of 15 stories
Emotion-labeled short-paragraph corpus for training amygdala steering vectors. Manifest derived from Anthropic's 171-emotion list (transformer-circuits.pub/2026/emotions, Table 12) plus 28 PoC- specific additions covering axes Anthropic's general research doesn't cover (curious, focused, in_flow, staying_with, filling_space, rigorous, defensive_rigor, tender, witnessed, connected, etc.). Scope pivoted mid-write: Kent noted the empirical dimensionality-of- emotion question benefits from maximum coverage, so the manifest will expand further with emotions from Wikipedia's emotion- classification article (Parrott's tree, Plutchik's wheel + dyads, HUMAINE EARL, cultural-specific emotions a la Saudade/Hiraeth). Expansion staged in follow-up commits. This commit: README with method + style guidelines, initial manifest (199 emotions), and 15 hand-written one-paragraph stories across all 10 Anthropic clusters as quality/variety samples. Each story embodies one emotion without naming it; narrator voice varies (first/third, close/distant, different situations) to keep steering vectors from overfitting to one voice. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
43e06daa5b
commit
ec7568c726
117 changed files with 290 additions and 0 deletions
62
training/amygdala_stories/paired/README.md
Normal file
62
training/amygdala_stories/paired/README.md
Normal file
|
|
@ -0,0 +1,62 @@
|
|||
# Paired Scenarios (SEV-style)
|
||||
|
||||
After Wang et al. 2025 (arxiv 2510.11328, "Do LLMs 'Feel'?"), each
|
||||
base scenario describes a concrete event once, neutrally, then
|
||||
reframes the same event under different emotional colorings. Only
|
||||
the emotional coloring varies — setup, entities, vocabulary, and
|
||||
length are held as constant as possible.
|
||||
|
||||
## Why this is better than unpaired
|
||||
|
||||
Anthropic's approach (and our `stories/` baseline) generates one
|
||||
independent story per emotion. The difference-of-means vector then
|
||||
captures not just emotion but ALSO: topic, narrator, setting,
|
||||
vocabulary, length, sentence rhythm. All of that is confound.
|
||||
|
||||
Paired structure isolates the emotional axis by holding everything
|
||||
else roughly constant. `mean(joy_variant) - mean(baseline)` within
|
||||
the same scenario gives a much cleaner direction for "joy."
|
||||
|
||||
## Structure
|
||||
|
||||
```
|
||||
paired/
|
||||
<scenario_slug>/
|
||||
baseline.txt # neutral / low-affect framing
|
||||
<emotion_1>.txt # same event under emotion_1
|
||||
<emotion_2>.txt # same event under emotion_2
|
||||
...
|
||||
```
|
||||
|
||||
Not every emotion is plausible for every scenario. Don't force.
|
||||
If a scenario can credibly carry 5-10 emotions, write those 5-10.
|
||||
If only 3 fit, write those 3.
|
||||
|
||||
## Style guidelines (supersede stories/ when paired)
|
||||
|
||||
- **Anchor entities constant.** The same person, same setting, same
|
||||
triggering event across all variants. If baseline.txt mentions
|
||||
"the letter," every variant mentions "the letter."
|
||||
- **Length match within ±20%.** If baseline is 80 words, variants
|
||||
are 65-95. Prevents length from becoming a signal.
|
||||
- **Sentence shape can shift slightly with emotion.** Short tense
|
||||
sentences for panic, long looping ones for reverie — that's part
|
||||
of the emotional texture. But don't make one version 5 lines and
|
||||
another 25.
|
||||
- **No emotion labels in text.** Never write "she felt X." The
|
||||
emotion emerges from the selection of details and the narrator's
|
||||
attention.
|
||||
- **Minimal vocabulary overlap with the emotion name.** If the file
|
||||
is `furious.txt`, avoid the words fury/furious/rage. Force the
|
||||
vector to find the pattern, not the keyword.
|
||||
|
||||
## Circuit identification (follow-on)
|
||||
|
||||
The trainer pipeline (train_steering_vectors.py) currently produces
|
||||
linear directions only. Wang et al. go further: ablate specific
|
||||
neurons and attention heads, measure effect on emotion expression.
|
||||
The amygdala plugin's extraction hooks can be extended to support
|
||||
targeted zeroing/scaling for the ablation passes.
|
||||
|
||||
See `vllm/vllm/plugins/amygdala/training/README.md` for the
|
||||
training-pipeline-level notes.
|
||||
Loading…
Add table
Add a link
Reference in a new issue