Previous direct/ had 'I feel X' first-person descriptions. The
training run showed they formed their own format-cluster: all 7
concepts leaned into the same 5-6 dims (d2455, d505, d2955,
d1236) with negative sign, while the 91 story-based concepts
leaned into those dims with positive sign. PCA found the
direct-vs-narrative format axis as a major variance direction,
isolating the 7 concepts in their own island.
Rewrite as 3rd-person narrative stories matching the rest of
the corpus. Keeps the explicit anchor phrases that worked ('it
all clicked into place', 'she was terrified', 'it was
anticipatory grief') but drops the first-person 'I feel X'
that was the format signal.
Each of the 7 concepts now has 3 narrative stories in varied
settings (conversations, drives, kitchens, mothers+grandmothers,
work, investigations). The blank-line-separated format is
still loaded by _load_direct_descriptions.
Also drop _baseline.txt — it was first-person ('I feel fine.
...') and would re-introduce the format mismatch. The ~90
story-based concepts provide plenty of narrative negatives
for each concept's training.
|
||
|---|---|---|
| .. | ||
| direct | ||
| paired | ||
| stories | ||
| manifest.json | ||
| README.md | ||
Amygdala Training Stories
Short first- and third-person paragraphs, each imbued with one of the
171 emotions from Anthropic's emotion-vector paper (Table 12,
transformer-circuits.pub/2026/emotions/). Feeds the steering-vector
trainer at vllm/vllm/plugins/amygdala/training/train_steering_vectors.py.
Method (replication of Anthropic, 2026)
Anthropic prompted Sonnet 4.5 to write short stories embodying each emotion, extracted activations during generation, and used difference- of-means (or SAEs) to identify the steering vector per emotion. Our pipeline does the same thing except:
- We generate the stories by hand rather than prompting a model, so the training data is grounded in actual writing rather than synthetic model-output. (Can supplement with model-generated paragraphs later.)
- Our eventual training goes through the amygdala plugin's extraction path, so we get the same hidden-state activations the plugin will read out at inference time.
Structure
training/amygdala_stories/
README.md
manifest.json # emotion -> cluster mapping
stories/
<emotion>.txt # one-paragraph story embodying the emotion
Emotion names use underscores (on_edge, worn_out, at_ease,
grief_stricken, self_confident, self_conscious, self_critical)
to match the filename.
Style guidelines
- One clear emotion per paragraph. Not mixed. If a second emotion
is named in the text, it should serve the primary one (e.g.
hostilecan mention rising heat or thrown objects but shouldn't shade intosad). - Embodied, not labeled. Don't write "she felt nervous." Write the sensation, the timing, the sentence shape that nervousness has.
- Specific particulars. A named object, a concrete setting, a detail that grounds the emotion. "The cold tile under bare feet at 3am" does more work than "the empty house."
- Variable narrator. Some first person, some third person, some close-third, some distant. Different genders, ages, settings. Prevents the steering vector from overfitting to one voice.
- Length: roughly one paragraph. ~40-120 words. Long enough to have texture, short enough that the paragraph is about the emotion and nothing else.
- Standalone. No references to other stories, no continuing characters across files.
Progress
Written stories live in stories/. Remaining emotions tracked via
diff against the full 171-emotion list in manifest.json.
Initial batch written by PoC 2026-04-17; aiming for at least one story per cluster before first training run, all 171 before considering the file "complete."