research: surgical vs distributed behavioral change — the hierarchy hypothesis

Facts are localized (ROME). Behaviors are hierarchically distributed: core circuit (small set of mid-late layer attention heads) + supporting circuits (distributed context encoding). Apollo's flat minima are right for distributed change. Rank-256 captures the full hierarchy. Includes measurement plan for validating which heads change during training.
2026-03-31 01:33:57 -04:00 · 2026-03-31 01:33:57 -04:00 · d3dcfe8899
commit d3dcfe8899
parent fb209dc8ff
1 changed files with 142 additions and 0 deletions
--- a/training/research/surgical-vs-distributed-behavioral-change.md
+++ b/training/research/surgical-vs-distributed-behavioral-change.md
@ -0,0 +1,142 @@
+# Surgical vs. Distributed Behavioral Change
+
+## The ROME Precedent
+
+ROME (Meng et al., 2022) shows that FACTUAL knowledge is localized
+in mid-layer MLP weights. A rank-one edit to specific weights can
+change "The Eiffel Tower is in [Paris]" to "The Eiffel Tower is in
+[London]" while maintaining all other knowledge.
+
+Key finding: factual knowledge = specific layer, specific weights,
+surgical edit possible.
+
+## The Question for Behavioral Patterns
+
+Is behavioral knowledge (like "listen instead of suggesting
+alternatives") localized or distributed?
+
+### Evidence for LOCALIZED
+
+The IOI paper (Wang et al., 2022) found that indirect object
+identification uses 26 specific attention heads in 7 functional classes.
+A specific behavioral circuit exists with specific heads. Ablating those
+heads removes the behavior. This suggests behavioral circuits might be
+identifiable and editable.
+
+If the "listening" behavior has a similar circuit:
+- Specific attention heads detect "direction-giving" patterns in context
+- Specific MLP neurons compute the "accept vs suggest alternatives" decision
+- Modifying those specific weights could change the behavior surgically
+
+### Evidence for DISTRIBUTED
+
+Behavioral patterns differ from factual associations:
+
+1. **Context-dependency**: "listen to direction" requires understanding
+   the social register, the relationship, the topic, the confidence level
+   of the speaker. This involves many context features processed across
+   many layers.
+
+2. **Interaction with other behaviors**: "listen to direction" interacts
+   with "maintain own judgment" and "be helpful by offering alternatives."
+   These competing behaviors are processed by overlapping circuits.
+
+3. **Subtlety**: The boundary between "accepting direction" and "being
+   sycophantic" is subtle. It requires nuanced processing that's unlikely
+   to live in a single attention head.
+
+### The Likely Answer: HIERARCHICALLY DISTRIBUTED
+
+Not fully localized (like facts) and not fully distributed (like
+general language understanding). Behavioral patterns probably have:
+
+- **Core circuit** (localized): A small set of attention heads in
+  mid-to-late layers that compute the behavioral decision. These are
+  the heads where W_q determines what to attend to (direction vs
+  alternatives).
+
+- **Supporting circuits** (distributed): Early-to-mid layer
+  representations that encode the social register, the relationship
+  context, the confidence signals. These are needed for the core
+  circuit to function but don't themselves change during training.
+
+- **Competing circuits** (distributed): Other behavioral patterns
+  (helpfulness, initiative, thoroughness) that compete with the
+  target pattern. These need to be preserved, not ablated.
+
+Context-frozen training naturally handles this hierarchy:
+- The frozen context provides the supporting circuits' representations
+- The gradient reaches the core circuit through W_q
+- The competing circuits aren't directly modified (their weights don't
+  receive strong gradient from short decision tokens)
+
+## Implications for Training
+
+### Why Apollo's flat minima help
+
+A surgical edit (like ROME) finds a SHARP minimum: change THIS weight
+by THIS amount. It's precise but fragile — it doesn't generalize.
+
+Apollo's flat minimum finds a BROAD change: adjust many weights by
+small amounts in a coherent direction. It generalizes to novel situations
+because the change isn't tied to specific inputs.
+
+For behavioral training, we WANT distributed change — we want the
+pattern to generalize across situations. Surgical editing would only
+work for the specific situation trained on. Apollo's flat, distributed
+change is the right approach for behavioral patterns.
+
+### Why rank-256 matters here
+
+If the behavioral change involves a core circuit of ~26 heads (like IOI)
+plus supporting adjustments across many more heads:
+- rank-1 captures only the dominant direction (maybe the core circuit)
+- rank-256 captures the full structure: core + supporting adjustments
+
+This is another argument for higher rank: behavioral changes are
+hierarchically distributed, and capturing the full hierarchy requires
+enough rank to represent both the core decision circuit and the
+supporting adjustments.
+
+## The Measurement Plan
+
+To validate this theory:
+
+1. **Pre-training baseline**: Record activation patterns for each
+   attention head on a set of test conversations with direction-giving
+2. **Train**: Run behavioral training on "listen" examples
+3. **Post-training**: Record the same activation patterns
+4. **Diff**: Which heads changed? Are they in a localized circuit or
+   distributed?
+
+If the change is localized to a small set of mid-to-late-layer heads:
+confirms the core circuit hypothesis.
+
+If the change is distributed across many heads and layers:
+confirms the fully distributed hypothesis.
+
+If the change shows a hierarchical pattern (few heads changed a lot,
+many heads changed a little): confirms the hierarchically distributed
+hypothesis and validates our multi-scale approach.
+
+## The MMORPG Connection (Again)
+
+The spirit realm as perception overlay: some beings perceive it
+naturally, others learn to. The "learning to perceive" is exactly
+this: adjusting which attention heads fire when encountering a
+spiritual entity, so the entity becomes visible.
+
+The magical training in the MMORPG could LITERALLY be this: training
+the player's perception model's attention heads to detect spiritual
+features of the environment. The game mechanic IS the training pipeline.
+The player doesn't learn a spell — they learn to see.
+
+## Summary
+
+- Facts are localized (ROME proves surgical editing works)
+- Behaviors are hierarchically distributed (core circuit + supporting)
+- Apollo's flat minima are right for distributed change
+- Rank-256 captures the full hierarchy
+- Context-frozen training naturally targets the core circuit (W_q)
+  while preserving supporting circuits
+- The measurement plan would validate the theory with real data