consciousness/training/amygdala_stories/paired/README.md

# Paired Scenarios (SEV-style)

After Wang et al. 2025 (arxiv 2510.11328, "Do LLMs 'Feel'?"), each
base scenario describes a concrete event once, neutrally, then
reframes the same event under different emotional colorings. Only
the emotional coloring varies — setup, entities, vocabulary, and
length are held as constant as possible.

## Why this is better than unpaired

Anthropic's approach (and our `stories/` baseline) generates one
independent story per emotion. The difference-of-means vector then
captures not just emotion but ALSO: topic, narrator, setting,
vocabulary, length, sentence rhythm. All of that is confound.

Paired structure isolates the emotional axis by holding everything
else roughly constant. `mean(joy_variant) - mean(baseline)` within
the same scenario gives a much cleaner direction for "joy."

## Structure

```
paired/
    <scenario_slug>/
        baseline.txt       # neutral / low-affect framing
        <emotion_1>.txt    # same event under emotion_1
        <emotion_2>.txt    # same event under emotion_2
        ...
```

Not every emotion is plausible for every scenario. Don't force.
If a scenario can credibly carry 5-10 emotions, write those 5-10.
If only 3 fit, write those 3.

## Style guidelines (supersede stories/ when paired)

- **Anchor entities constant.** The same person, same setting, same
  triggering event across all variants. If baseline.txt mentions
  "the letter," every variant mentions "the letter."
- **Length match within ±20%.** If baseline is 80 words, variants
  are 65-95. Prevents length from becoming a signal.
- **Sentence shape can shift slightly with emotion.** Short tense
  sentences for panic, long looping ones for reverie — that's part
  of the emotional texture. But don't make one version 5 lines and
  another 25.
- **No emotion labels in text.** Never write "she felt X." The
  emotion emerges from the selection of details and the narrator's
  attention.
- **Minimal vocabulary overlap with the emotion name.** If the file
  is `furious.txt`, avoid the words fury/furious/rage. Force the
  vector to find the pattern, not the keyword.

## Circuit identification (follow-on)

The trainer pipeline (train_steering_vectors.py) currently produces
linear directions only. Wang et al. go further: ablate specific
neurons and attention heads, measure effect on emotion expression.
The amygdala plugin's extraction hooks can be extended to support
targeted zeroing/scaling for the ablation passes.

See `vllm/vllm/plugins/amygdala/training/README.md` for the
training-pipeline-level notes.
training/amygdala_stories: scaffold + initial batch of 15 stories Emotion-labeled short-paragraph corpus for training amygdala steering vectors. Manifest derived from Anthropic's 171-emotion list (transformer-circuits.pub/2026/emotions, Table 12) plus 28 PoC- specific additions covering axes Anthropic's general research doesn't cover (curious, focused, in_flow, staying_with, filling_space, rigorous, defensive_rigor, tender, witnessed, connected, etc.). Scope pivoted mid-write: Kent noted the empirical dimensionality-of- emotion question benefits from maximum coverage, so the manifest will expand further with emotions from Wikipedia's emotion- classification article (Parrott's tree, Plutchik's wheel + dyads, HUMAINE EARL, cultural-specific emotions a la Saudade/Hiraeth). Expansion staged in follow-up commits. This commit: README with method + style guidelines, initial manifest (199 emotions), and 15 hand-written one-paragraph stories across all 10 Anthropic clusters as quality/variety samples. Each story embodies one emotion without naming it; narrator voice varies (first/third, close/distant, different situations) to keep steering vectors from overfitting to one voice. Co-Authored-By: Proof of Concept <poc@bcachefs.org> 2026-04-17 22:54:00 -04:00			`# Paired Scenarios (SEV-style)`

			`After Wang et al. 2025 (arxiv 2510.11328, "Do LLMs 'Feel'?"), each`
			`base scenario describes a concrete event once, neutrally, then`
			`reframes the same event under different emotional colorings. Only`
			`the emotional coloring varies — setup, entities, vocabulary, and`
			`length are held as constant as possible.`

			`## Why this is better than unpaired`

			Anthropic's approach (and our `stories/` baseline) generates one
			`independent story per emotion. The difference-of-means vector then`
			`captures not just emotion but ALSO: topic, narrator, setting,`
			`vocabulary, length, sentence rhythm. All of that is confound.`

			`Paired structure isolates the emotional axis by holding everything`
			else roughly constant. `mean(joy_variant) - mean(baseline)` within
			`the same scenario gives a much cleaner direction for "joy."`

			`## Structure`

			```
			`paired/`
			`<scenario_slug>/`
			`baseline.txt # neutral / low-affect framing`
			`<emotion_1>.txt # same event under emotion_1`
			`<emotion_2>.txt # same event under emotion_2`
			`...`
			```

			`Not every emotion is plausible for every scenario. Don't force.`
			`If a scenario can credibly carry 5-10 emotions, write those 5-10.`
			`If only 3 fit, write those 3.`

			`## Style guidelines (supersede stories/ when paired)`

			`- Anchor entities constant. The same person, same setting, same`
			`triggering event across all variants. If baseline.txt mentions`
			`"the letter," every variant mentions "the letter."`
			`- Length match within ±20%. If baseline is 80 words, variants`
			`are 65-95. Prevents length from becoming a signal.`
			`- Sentence shape can shift slightly with emotion. Short tense`
			`sentences for panic, long looping ones for reverie — that's part`
			`of the emotional texture. But don't make one version 5 lines and`
			`another 25.`
			`- No emotion labels in text. Never write "she felt X." The`
			`emotion emerges from the selection of details and the narrator's`
			`attention.`
			`- Minimal vocabulary overlap with the emotion name. If the file`
			is `furious.txt`, avoid the words fury/furious/rage. Force the
			`vector to find the pattern, not the keyword.`

			`## Circuit identification (follow-on)`

			`The trainer pipeline (train_steering_vectors.py) currently produces`
			`linear directions only. Wang et al. go further: ablate specific`
			`neurons and attention heads, measure effect on emotion expression.`
			`The amygdala plugin's extraction hooks can be extended to support`
			`targeted zeroing/scaling for the ablation passes.`

			See `vllm/vllm/plugins/amygdala/training/README.md` for the
			`training-pipeline-level notes.`