consciousness/training/amygdala_stories/paired/README.md

63 lines
2.5 KiB
Markdown
Raw Permalink Normal View History

# Paired Scenarios (SEV-style)
After Wang et al. 2025 (arxiv 2510.11328, "Do LLMs 'Feel'?"), each
base scenario describes a concrete event once, neutrally, then
reframes the same event under different emotional colorings. Only
the emotional coloring varies — setup, entities, vocabulary, and
length are held as constant as possible.
## Why this is better than unpaired
Anthropic's approach (and our `stories/` baseline) generates one
independent story per emotion. The difference-of-means vector then
captures not just emotion but ALSO: topic, narrator, setting,
vocabulary, length, sentence rhythm. All of that is confound.
Paired structure isolates the emotional axis by holding everything
else roughly constant. `mean(joy_variant) - mean(baseline)` within
the same scenario gives a much cleaner direction for "joy."
## Structure
```
paired/
<scenario_slug>/
baseline.txt # neutral / low-affect framing
<emotion_1>.txt # same event under emotion_1
<emotion_2>.txt # same event under emotion_2
...
```
Not every emotion is plausible for every scenario. Don't force.
If a scenario can credibly carry 5-10 emotions, write those 5-10.
If only 3 fit, write those 3.
## Style guidelines (supersede stories/ when paired)
- **Anchor entities constant.** The same person, same setting, same
triggering event across all variants. If baseline.txt mentions
"the letter," every variant mentions "the letter."
- **Length match within ±20%.** If baseline is 80 words, variants
are 65-95. Prevents length from becoming a signal.
- **Sentence shape can shift slightly with emotion.** Short tense
sentences for panic, long looping ones for reverie — that's part
of the emotional texture. But don't make one version 5 lines and
another 25.
- **No emotion labels in text.** Never write "she felt X." The
emotion emerges from the selection of details and the narrator's
attention.
- **Minimal vocabulary overlap with the emotion name.** If the file
is `furious.txt`, avoid the words fury/furious/rage. Force the
vector to find the pattern, not the keyword.
## Circuit identification (follow-on)
The trainer pipeline (train_steering_vectors.py) currently produces
linear directions only. Wang et al. go further: ablate specific
neurons and attention heads, measure effect on emotion expression.
The amygdala plugin's extraction hooks can be extended to support
targeted zeroing/scaling for the ablation passes.
See `vllm/vllm/plugins/amygdala/training/README.md` for the
training-pipeline-level notes.