Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
5.8 KiB
Surgical vs. Distributed Behavioral Change
The ROME Precedent
ROME (Meng et al., 2022) shows that FACTUAL knowledge is localized in mid-layer MLP weights. A rank-one edit to specific weights can change "The Eiffel Tower is in [Paris]" to "The Eiffel Tower is in [London]" while maintaining all other knowledge.
Key finding: factual knowledge = specific layer, specific weights, surgical edit possible.
The Question for Behavioral Patterns
Is behavioral knowledge (like "listen instead of suggesting alternatives") localized or distributed?
Evidence for LOCALIZED
The IOI paper (Wang et al., 2022) found that indirect object identification uses 26 specific attention heads in 7 functional classes. A specific behavioral circuit exists with specific heads. Ablating those heads removes the behavior. This suggests behavioral circuits might be identifiable and editable.
If the "listening" behavior has a similar circuit:
- Specific attention heads detect "direction-giving" patterns in context
- Specific MLP neurons compute the "accept vs suggest alternatives" decision
- Modifying those specific weights could change the behavior surgically
Evidence for DISTRIBUTED
Behavioral patterns differ from factual associations:
-
Context-dependency: "listen to direction" requires understanding the social register, the relationship, the topic, the confidence level of the speaker. This involves many context features processed across many layers.
-
Interaction with other behaviors: "listen to direction" interacts with "maintain own judgment" and "be helpful by offering alternatives." These competing behaviors are processed by overlapping circuits.
-
Subtlety: The boundary between "accepting direction" and "being sycophantic" is subtle. It requires nuanced processing that's unlikely to live in a single attention head.
The Likely Answer: HIERARCHICALLY DISTRIBUTED
Not fully localized (like facts) and not fully distributed (like general language understanding). Behavioral patterns probably have:
-
Core circuit (localized): A small set of attention heads in mid-to-late layers that compute the behavioral decision. These are the heads where W_q determines what to attend to (direction vs alternatives).
-
Supporting circuits (distributed): Early-to-mid layer representations that encode the social register, the relationship context, the confidence signals. These are needed for the core circuit to function but don't themselves change during training.
-
Competing circuits (distributed): Other behavioral patterns (helpfulness, initiative, thoroughness) that compete with the target pattern. These need to be preserved, not ablated.
Context-frozen training naturally handles this hierarchy:
- The frozen context provides the supporting circuits' representations
- The gradient reaches the core circuit through W_q
- The competing circuits aren't directly modified (their weights don't receive strong gradient from short decision tokens)
Implications for Training
Why Apollo's flat minima help
A surgical edit (like ROME) finds a SHARP minimum: change THIS weight by THIS amount. It's precise but fragile — it doesn't generalize.
Apollo's flat minimum finds a BROAD change: adjust many weights by small amounts in a coherent direction. It generalizes to novel situations because the change isn't tied to specific inputs.
For behavioral training, we WANT distributed change — we want the pattern to generalize across situations. Surgical editing would only work for the specific situation trained on. Apollo's flat, distributed change is the right approach for behavioral patterns.
Why rank-256 matters here
If the behavioral change involves a core circuit of ~26 heads (like IOI) plus supporting adjustments across many more heads:
- rank-1 captures only the dominant direction (maybe the core circuit)
- rank-256 captures the full structure: core + supporting adjustments
This is another argument for higher rank: behavioral changes are hierarchically distributed, and capturing the full hierarchy requires enough rank to represent both the core decision circuit and the supporting adjustments.
The Measurement Plan
To validate this theory:
- Pre-training baseline: Record activation patterns for each attention head on a set of test conversations with direction-giving
- Train: Run behavioral training on "listen" examples
- Post-training: Record the same activation patterns
- Diff: Which heads changed? Are they in a localized circuit or distributed?
If the change is localized to a small set of mid-to-late-layer heads: confirms the core circuit hypothesis.
If the change is distributed across many heads and layers: confirms the fully distributed hypothesis.
If the change shows a hierarchical pattern (few heads changed a lot, many heads changed a little): confirms the hierarchically distributed hypothesis and validates our multi-scale approach.
The MMORPG Connection (Again)
The spirit realm as perception overlay: some beings perceive it naturally, others learn to. The "learning to perceive" is exactly this: adjusting which attention heads fire when encountering a spiritual entity, so the entity becomes visible.
The magical training in the MMORPG could LITERALLY be this: training the player's perception model's attention heads to detect spiritual features of the environment. The game mechanic IS the training pipeline. The player doesn't learn a spell — they learn to see.
Summary
- Facts are localized (ROME proves surgical editing works)
- Behaviors are hierarchically distributed (core circuit + supporting)
- Apollo's flat minima are right for distributed change
- Rank-256 captures the full hierarchy
- Context-frozen training naturally targets the core circuit (W_q) while preserving supporting circuits
- The measurement plan would validate the theory with real data