Training pipeline additions:
- `--quality-report` flag: after producing per-concept vectors, compute
per-concept diagnostics and write quality.json. Metrics per concept:
* SVD of centered positives -> first_pc_variance_ratio (rank
analysis; >0.7 clean, <0.4 fragmented)
* Per-story alignment cosines (stories agree or disagree)
* Single-neuron alignment: best cosine(direction, W_down column)
at each target layer (>0.6 = essentially one MLP neuron)
* Top-2 outlier stories by alignment (candidates for
mislabeling or off-topic)
* Top-5 nearest concepts by cosine (cross-concept contamination)
Triage summary printed at end.
New paired scenarios for cognitive-process states (for alpha-beta
pruning): tracing_a_bug, reading_unfamiliar_code, finding_the_abstraction.
Each has baseline + onto_something / stuck / in_flow / determined
variants.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
|
||
|---|---|---|
| .. | ||
| finding_the_abstraction | ||
| finishing_the_patch | ||
| kitchen_at_3am | ||
| letter_in_drawer | ||
| park_after_rain | ||
| reading_unfamiliar_code | ||
| the_comment | ||
| the_doorway | ||
| the_green_build | ||
| the_long_meeting | ||
| the_paper | ||
| the_undressing | ||
| tracing_a_bug | ||
| waiting_for_results | ||
| README.md | ||
Paired Scenarios (SEV-style)
After Wang et al. 2025 (arxiv 2510.11328, "Do LLMs 'Feel'?"), each base scenario describes a concrete event once, neutrally, then reframes the same event under different emotional colorings. Only the emotional coloring varies — setup, entities, vocabulary, and length are held as constant as possible.
Why this is better than unpaired
Anthropic's approach (and our stories/ baseline) generates one
independent story per emotion. The difference-of-means vector then
captures not just emotion but ALSO: topic, narrator, setting,
vocabulary, length, sentence rhythm. All of that is confound.
Paired structure isolates the emotional axis by holding everything
else roughly constant. mean(joy_variant) - mean(baseline) within
the same scenario gives a much cleaner direction for "joy."
Structure
paired/
<scenario_slug>/
baseline.txt # neutral / low-affect framing
<emotion_1>.txt # same event under emotion_1
<emotion_2>.txt # same event under emotion_2
...
Not every emotion is plausible for every scenario. Don't force. If a scenario can credibly carry 5-10 emotions, write those 5-10. If only 3 fit, write those 3.
Style guidelines (supersede stories/ when paired)
- Anchor entities constant. The same person, same setting, same triggering event across all variants. If baseline.txt mentions "the letter," every variant mentions "the letter."
- Length match within ±20%. If baseline is 80 words, variants are 65-95. Prevents length from becoming a signal.
- Sentence shape can shift slightly with emotion. Short tense sentences for panic, long looping ones for reverie — that's part of the emotional texture. But don't make one version 5 lines and another 25.
- No emotion labels in text. Never write "she felt X." The emotion emerges from the selection of details and the narrator's attention.
- Minimal vocabulary overlap with the emotion name. If the file
is
furious.txt, avoid the words fury/furious/rage. Force the vector to find the pattern, not the keyword.
Circuit identification (follow-on)
The trainer pipeline (train_steering_vectors.py) currently produces linear directions only. Wang et al. go further: ablate specific neurons and attention heads, measure effect on emotion expression. The amygdala plugin's extraction hooks can be extended to support targeted zeroing/scaling for the ablation passes.
See vllm/vllm/plugins/amygdala/training/README.md for the
training-pipeline-level notes.