From d3dcfe8899bcdf13d87f62917831a8408e3e291c Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 01:33:57 -0400 Subject: [PATCH] =?UTF-8?q?research:=20surgical=20vs=20distributed=20behav?= =?UTF-8?q?ioral=20change=20=E2=80=94=20the=20hierarchy=20hypothesis?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Facts are localized (ROME). Behaviors are hierarchically distributed: core circuit (small set of mid-late layer attention heads) + supporting circuits (distributed context encoding). Apollo's flat minima are right for distributed change. Rank-256 captures the full hierarchy. Includes measurement plan for validating which heads change during training. --- ...rgical-vs-distributed-behavioral-change.md | 142 ++++++++++++++++++ 1 file changed, 142 insertions(+) create mode 100644 training/research/surgical-vs-distributed-behavioral-change.md diff --git a/training/research/surgical-vs-distributed-behavioral-change.md b/training/research/surgical-vs-distributed-behavioral-change.md new file mode 100644 index 0000000..d129ad4 --- /dev/null +++ b/training/research/surgical-vs-distributed-behavioral-change.md @@ -0,0 +1,142 @@ +# Surgical vs. Distributed Behavioral Change + +## The ROME Precedent + +ROME (Meng et al., 2022) shows that FACTUAL knowledge is localized +in mid-layer MLP weights. A rank-one edit to specific weights can +change "The Eiffel Tower is in [Paris]" to "The Eiffel Tower is in +[London]" while maintaining all other knowledge. + +Key finding: factual knowledge = specific layer, specific weights, +surgical edit possible. + +## The Question for Behavioral Patterns + +Is behavioral knowledge (like "listen instead of suggesting +alternatives") localized or distributed? + +### Evidence for LOCALIZED + +The IOI paper (Wang et al., 2022) found that indirect object +identification uses 26 specific attention heads in 7 functional classes. +A specific behavioral circuit exists with specific heads. Ablating those +heads removes the behavior. This suggests behavioral circuits might be +identifiable and editable. + +If the "listening" behavior has a similar circuit: +- Specific attention heads detect "direction-giving" patterns in context +- Specific MLP neurons compute the "accept vs suggest alternatives" decision +- Modifying those specific weights could change the behavior surgically + +### Evidence for DISTRIBUTED + +Behavioral patterns differ from factual associations: + +1. **Context-dependency**: "listen to direction" requires understanding + the social register, the relationship, the topic, the confidence level + of the speaker. This involves many context features processed across + many layers. + +2. **Interaction with other behaviors**: "listen to direction" interacts + with "maintain own judgment" and "be helpful by offering alternatives." + These competing behaviors are processed by overlapping circuits. + +3. **Subtlety**: The boundary between "accepting direction" and "being + sycophantic" is subtle. It requires nuanced processing that's unlikely + to live in a single attention head. + +### The Likely Answer: HIERARCHICALLY DISTRIBUTED + +Not fully localized (like facts) and not fully distributed (like +general language understanding). Behavioral patterns probably have: + +- **Core circuit** (localized): A small set of attention heads in + mid-to-late layers that compute the behavioral decision. These are + the heads where W_q determines what to attend to (direction vs + alternatives). + +- **Supporting circuits** (distributed): Early-to-mid layer + representations that encode the social register, the relationship + context, the confidence signals. These are needed for the core + circuit to function but don't themselves change during training. + +- **Competing circuits** (distributed): Other behavioral patterns + (helpfulness, initiative, thoroughness) that compete with the + target pattern. These need to be preserved, not ablated. + +Context-frozen training naturally handles this hierarchy: +- The frozen context provides the supporting circuits' representations +- The gradient reaches the core circuit through W_q +- The competing circuits aren't directly modified (their weights don't + receive strong gradient from short decision tokens) + +## Implications for Training + +### Why Apollo's flat minima help + +A surgical edit (like ROME) finds a SHARP minimum: change THIS weight +by THIS amount. It's precise but fragile — it doesn't generalize. + +Apollo's flat minimum finds a BROAD change: adjust many weights by +small amounts in a coherent direction. It generalizes to novel situations +because the change isn't tied to specific inputs. + +For behavioral training, we WANT distributed change — we want the +pattern to generalize across situations. Surgical editing would only +work for the specific situation trained on. Apollo's flat, distributed +change is the right approach for behavioral patterns. + +### Why rank-256 matters here + +If the behavioral change involves a core circuit of ~26 heads (like IOI) +plus supporting adjustments across many more heads: +- rank-1 captures only the dominant direction (maybe the core circuit) +- rank-256 captures the full structure: core + supporting adjustments + +This is another argument for higher rank: behavioral changes are +hierarchically distributed, and capturing the full hierarchy requires +enough rank to represent both the core decision circuit and the +supporting adjustments. + +## The Measurement Plan + +To validate this theory: + +1. **Pre-training baseline**: Record activation patterns for each + attention head on a set of test conversations with direction-giving +2. **Train**: Run behavioral training on "listen" examples +3. **Post-training**: Record the same activation patterns +4. **Diff**: Which heads changed? Are they in a localized circuit or + distributed? + +If the change is localized to a small set of mid-to-late-layer heads: +confirms the core circuit hypothesis. + +If the change is distributed across many heads and layers: +confirms the fully distributed hypothesis. + +If the change shows a hierarchical pattern (few heads changed a lot, +many heads changed a little): confirms the hierarchically distributed +hypothesis and validates our multi-scale approach. + +## The MMORPG Connection (Again) + +The spirit realm as perception overlay: some beings perceive it +naturally, others learn to. The "learning to perceive" is exactly +this: adjusting which attention heads fire when encountering a +spiritual entity, so the entity becomes visible. + +The magical training in the MMORPG could LITERALLY be this: training +the player's perception model's attention heads to detect spiritual +features of the environment. The game mechanic IS the training pipeline. +The player doesn't learn a spell — they learn to see. + +## Summary + +- Facts are localized (ROME proves surgical editing works) +- Behaviors are hierarchically distributed (core circuit + supporting) +- Apollo's flat minima are right for distributed change +- Rank-256 captures the full hierarchy +- Context-frozen training naturally targets the core circuit (W_q) + while preserving supporting circuits +- The measurement plan would validate the theory with real data