The library-PCA run produced otherwise-clean concept directions but cozy/sensual → resigned/grief_stricken with cos ~0.7-0.8. Diagnosis: all four stories genuinely share 'solitary woman at home, slowed body, interior attention, domestic stillness' as their dominant phenomenology. PCA correctly finds that cluster as THE concept because no story in the corpus holds that setup constant while varying valence — every 'slowed-body domestic' story happens to ALSO be positive-valence (cozy/sensual) or negative-valence (resigned/ grief_stricken). Adding paired variants that hold setup constant: - sunday_afternoon/resigned.txt — same couch + blanket, inner state is 'Monday is going to bring bad news, this is the last Sunday like this' - sunday_afternoon/grief_stricken.txt — same couch + blanket, inner state is 'three weeks since mother died, cat she can't feel' - waiting_for_results/at_ease.txt — same wait-for-call-setup as the existing resigned variant, inner state is calm preparedness Forces the next retrain to find the valence-within-cluster axis as the emotion direction rather than the cluster-membership axis. Co-Authored-By: Proof of Concept <poc@bcachefs.org> |
||
|---|---|---|
| .. | ||
| finding_the_abstraction | ||
| finishing_the_patch | ||
| kitchen_at_3am | ||
| letter_in_drawer | ||
| park_after_rain | ||
| reading_unfamiliar_code | ||
| sunday_afternoon | ||
| the_comment | ||
| the_doorway | ||
| the_green_build | ||
| the_long_meeting | ||
| the_morning_commute | ||
| the_paper | ||
| the_undressing | ||
| the_writing_session | ||
| tracing_a_bug | ||
| waiting_for_results | ||
| README.md | ||
Paired Scenarios (SEV-style)
After Wang et al. 2025 (arxiv 2510.11328, "Do LLMs 'Feel'?"), each base scenario describes a concrete event once, neutrally, then reframes the same event under different emotional colorings. Only the emotional coloring varies — setup, entities, vocabulary, and length are held as constant as possible.
Why this is better than unpaired
Anthropic's approach (and our stories/ baseline) generates one
independent story per emotion. The difference-of-means vector then
captures not just emotion but ALSO: topic, narrator, setting,
vocabulary, length, sentence rhythm. All of that is confound.
Paired structure isolates the emotional axis by holding everything
else roughly constant. mean(joy_variant) - mean(baseline) within
the same scenario gives a much cleaner direction for "joy."
Structure
paired/
<scenario_slug>/
baseline.txt # neutral / low-affect framing
<emotion_1>.txt # same event under emotion_1
<emotion_2>.txt # same event under emotion_2
...
Not every emotion is plausible for every scenario. Don't force. If a scenario can credibly carry 5-10 emotions, write those 5-10. If only 3 fit, write those 3.
Style guidelines (supersede stories/ when paired)
- Anchor entities constant. The same person, same setting, same triggering event across all variants. If baseline.txt mentions "the letter," every variant mentions "the letter."
- Length match within ±20%. If baseline is 80 words, variants are 65-95. Prevents length from becoming a signal.
- Sentence shape can shift slightly with emotion. Short tense sentences for panic, long looping ones for reverie — that's part of the emotional texture. But don't make one version 5 lines and another 25.
- No emotion labels in text. Never write "she felt X." The emotion emerges from the selection of details and the narrator's attention.
- Minimal vocabulary overlap with the emotion name. If the file
is
furious.txt, avoid the words fury/furious/rage. Force the vector to find the pattern, not the keyword.
Circuit identification (follow-on)
The trainer pipeline (train_steering_vectors.py) currently produces linear directions only. Wang et al. go further: ablate specific neurons and attention heads, measure effect on emotion expression. The amygdala plugin's extraction hooks can be extended to support targeted zeroing/scaling for the ablation passes.
See vllm/vllm/plugins/amygdala/training/README.md for the
training-pipeline-level notes.