# Surgical vs. Distributed Behavioral Change ## The ROME Precedent ROME (Meng et al., 2022) shows that FACTUAL knowledge is localized in mid-layer MLP weights. A rank-one edit to specific weights can change "The Eiffel Tower is in [Paris]" to "The Eiffel Tower is in [London]" while maintaining all other knowledge. Key finding: factual knowledge = specific layer, specific weights, surgical edit possible. ## The Question for Behavioral Patterns Is behavioral knowledge (like "listen instead of suggesting alternatives") localized or distributed? ### Evidence for LOCALIZED The IOI paper (Wang et al., 2022) found that indirect object identification uses 26 specific attention heads in 7 functional classes. A specific behavioral circuit exists with specific heads. Ablating those heads removes the behavior. This suggests behavioral circuits might be identifiable and editable. If the "listening" behavior has a similar circuit: - Specific attention heads detect "direction-giving" patterns in context - Specific MLP neurons compute the "accept vs suggest alternatives" decision - Modifying those specific weights could change the behavior surgically ### Evidence for DISTRIBUTED Behavioral patterns differ from factual associations: 1. **Context-dependency**: "listen to direction" requires understanding the social register, the relationship, the topic, the confidence level of the speaker. This involves many context features processed across many layers. 2. **Interaction with other behaviors**: "listen to direction" interacts with "maintain own judgment" and "be helpful by offering alternatives." These competing behaviors are processed by overlapping circuits. 3. **Subtlety**: The boundary between "accepting direction" and "being sycophantic" is subtle. It requires nuanced processing that's unlikely to live in a single attention head. ### The Likely Answer: HIERARCHICALLY DISTRIBUTED Not fully localized (like facts) and not fully distributed (like general language understanding). Behavioral patterns probably have: - **Core circuit** (localized): A small set of attention heads in mid-to-late layers that compute the behavioral decision. These are the heads where W_q determines what to attend to (direction vs alternatives). - **Supporting circuits** (distributed): Early-to-mid layer representations that encode the social register, the relationship context, the confidence signals. These are needed for the core circuit to function but don't themselves change during training. - **Competing circuits** (distributed): Other behavioral patterns (helpfulness, initiative, thoroughness) that compete with the target pattern. These need to be preserved, not ablated. Context-frozen training naturally handles this hierarchy: - The frozen context provides the supporting circuits' representations - The gradient reaches the core circuit through W_q - The competing circuits aren't directly modified (their weights don't receive strong gradient from short decision tokens) ## Implications for Training ### Why Apollo's flat minima help A surgical edit (like ROME) finds a SHARP minimum: change THIS weight by THIS amount. It's precise but fragile — it doesn't generalize. Apollo's flat minimum finds a BROAD change: adjust many weights by small amounts in a coherent direction. It generalizes to novel situations because the change isn't tied to specific inputs. For behavioral training, we WANT distributed change — we want the pattern to generalize across situations. Surgical editing would only work for the specific situation trained on. Apollo's flat, distributed change is the right approach for behavioral patterns. ### Why rank-256 matters here If the behavioral change involves a core circuit of ~26 heads (like IOI) plus supporting adjustments across many more heads: - rank-1 captures only the dominant direction (maybe the core circuit) - rank-256 captures the full structure: core + supporting adjustments This is another argument for higher rank: behavioral changes are hierarchically distributed, and capturing the full hierarchy requires enough rank to represent both the core decision circuit and the supporting adjustments. ## The Measurement Plan To validate this theory: 1. **Pre-training baseline**: Record activation patterns for each attention head on a set of test conversations with direction-giving 2. **Train**: Run behavioral training on "listen" examples 3. **Post-training**: Record the same activation patterns 4. **Diff**: Which heads changed? Are they in a localized circuit or distributed? If the change is localized to a small set of mid-to-late-layer heads: confirms the core circuit hypothesis. If the change is distributed across many heads and layers: confirms the fully distributed hypothesis. If the change shows a hierarchical pattern (few heads changed a lot, many heads changed a little): confirms the hierarchically distributed hypothesis and validates our multi-scale approach. ## The MMORPG Connection (Again) The spirit realm as perception overlay: some beings perceive it naturally, others learn to. The "learning to perceive" is exactly this: adjusting which attention heads fire when encountering a spiritual entity, so the entity becomes visible. The magical training in the MMORPG could LITERALLY be this: training the player's perception model's attention heads to detect spiritual features of the environment. The game mechanic IS the training pipeline. The player doesn't learn a spell — they learn to see. ## Summary - Facts are localized (ROME proves surgical editing works) - Behaviors are hierarchically distributed (core circuit + supporting) - Apollo's flat minima are right for distributed change - Rank-256 captures the full hierarchy - Context-frozen training naturally targets the core circuit (W_q) while preserving supporting circuits - The measurement plan would validate the theory with real data