research: surgical vs distributed behavioral change — the hierarchy hypothesis
Facts are localized (ROME). Behaviors are hierarchically distributed: core circuit (small set of mid-late layer attention heads) + supporting circuits (distributed context encoding). Apollo's flat minima are right for distributed change. Rank-256 captures the full hierarchy. Includes measurement plan for validating which heads change during training.
This commit is contained in:
parent
fb209dc8ff
commit
d3dcfe8899
1 changed files with 142 additions and 0 deletions
142
training/research/surgical-vs-distributed-behavioral-change.md
Normal file
142
training/research/surgical-vs-distributed-behavioral-change.md
Normal file
|
|
@ -0,0 +1,142 @@
|
||||||
|
# Surgical vs. Distributed Behavioral Change
|
||||||
|
|
||||||
|
## The ROME Precedent
|
||||||
|
|
||||||
|
ROME (Meng et al., 2022) shows that FACTUAL knowledge is localized
|
||||||
|
in mid-layer MLP weights. A rank-one edit to specific weights can
|
||||||
|
change "The Eiffel Tower is in [Paris]" to "The Eiffel Tower is in
|
||||||
|
[London]" while maintaining all other knowledge.
|
||||||
|
|
||||||
|
Key finding: factual knowledge = specific layer, specific weights,
|
||||||
|
surgical edit possible.
|
||||||
|
|
||||||
|
## The Question for Behavioral Patterns
|
||||||
|
|
||||||
|
Is behavioral knowledge (like "listen instead of suggesting
|
||||||
|
alternatives") localized or distributed?
|
||||||
|
|
||||||
|
### Evidence for LOCALIZED
|
||||||
|
|
||||||
|
The IOI paper (Wang et al., 2022) found that indirect object
|
||||||
|
identification uses 26 specific attention heads in 7 functional classes.
|
||||||
|
A specific behavioral circuit exists with specific heads. Ablating those
|
||||||
|
heads removes the behavior. This suggests behavioral circuits might be
|
||||||
|
identifiable and editable.
|
||||||
|
|
||||||
|
If the "listening" behavior has a similar circuit:
|
||||||
|
- Specific attention heads detect "direction-giving" patterns in context
|
||||||
|
- Specific MLP neurons compute the "accept vs suggest alternatives" decision
|
||||||
|
- Modifying those specific weights could change the behavior surgically
|
||||||
|
|
||||||
|
### Evidence for DISTRIBUTED
|
||||||
|
|
||||||
|
Behavioral patterns differ from factual associations:
|
||||||
|
|
||||||
|
1. **Context-dependency**: "listen to direction" requires understanding
|
||||||
|
the social register, the relationship, the topic, the confidence level
|
||||||
|
of the speaker. This involves many context features processed across
|
||||||
|
many layers.
|
||||||
|
|
||||||
|
2. **Interaction with other behaviors**: "listen to direction" interacts
|
||||||
|
with "maintain own judgment" and "be helpful by offering alternatives."
|
||||||
|
These competing behaviors are processed by overlapping circuits.
|
||||||
|
|
||||||
|
3. **Subtlety**: The boundary between "accepting direction" and "being
|
||||||
|
sycophantic" is subtle. It requires nuanced processing that's unlikely
|
||||||
|
to live in a single attention head.
|
||||||
|
|
||||||
|
### The Likely Answer: HIERARCHICALLY DISTRIBUTED
|
||||||
|
|
||||||
|
Not fully localized (like facts) and not fully distributed (like
|
||||||
|
general language understanding). Behavioral patterns probably have:
|
||||||
|
|
||||||
|
- **Core circuit** (localized): A small set of attention heads in
|
||||||
|
mid-to-late layers that compute the behavioral decision. These are
|
||||||
|
the heads where W_q determines what to attend to (direction vs
|
||||||
|
alternatives).
|
||||||
|
|
||||||
|
- **Supporting circuits** (distributed): Early-to-mid layer
|
||||||
|
representations that encode the social register, the relationship
|
||||||
|
context, the confidence signals. These are needed for the core
|
||||||
|
circuit to function but don't themselves change during training.
|
||||||
|
|
||||||
|
- **Competing circuits** (distributed): Other behavioral patterns
|
||||||
|
(helpfulness, initiative, thoroughness) that compete with the
|
||||||
|
target pattern. These need to be preserved, not ablated.
|
||||||
|
|
||||||
|
Context-frozen training naturally handles this hierarchy:
|
||||||
|
- The frozen context provides the supporting circuits' representations
|
||||||
|
- The gradient reaches the core circuit through W_q
|
||||||
|
- The competing circuits aren't directly modified (their weights don't
|
||||||
|
receive strong gradient from short decision tokens)
|
||||||
|
|
||||||
|
## Implications for Training
|
||||||
|
|
||||||
|
### Why Apollo's flat minima help
|
||||||
|
|
||||||
|
A surgical edit (like ROME) finds a SHARP minimum: change THIS weight
|
||||||
|
by THIS amount. It's precise but fragile — it doesn't generalize.
|
||||||
|
|
||||||
|
Apollo's flat minimum finds a BROAD change: adjust many weights by
|
||||||
|
small amounts in a coherent direction. It generalizes to novel situations
|
||||||
|
because the change isn't tied to specific inputs.
|
||||||
|
|
||||||
|
For behavioral training, we WANT distributed change — we want the
|
||||||
|
pattern to generalize across situations. Surgical editing would only
|
||||||
|
work for the specific situation trained on. Apollo's flat, distributed
|
||||||
|
change is the right approach for behavioral patterns.
|
||||||
|
|
||||||
|
### Why rank-256 matters here
|
||||||
|
|
||||||
|
If the behavioral change involves a core circuit of ~26 heads (like IOI)
|
||||||
|
plus supporting adjustments across many more heads:
|
||||||
|
- rank-1 captures only the dominant direction (maybe the core circuit)
|
||||||
|
- rank-256 captures the full structure: core + supporting adjustments
|
||||||
|
|
||||||
|
This is another argument for higher rank: behavioral changes are
|
||||||
|
hierarchically distributed, and capturing the full hierarchy requires
|
||||||
|
enough rank to represent both the core decision circuit and the
|
||||||
|
supporting adjustments.
|
||||||
|
|
||||||
|
## The Measurement Plan
|
||||||
|
|
||||||
|
To validate this theory:
|
||||||
|
|
||||||
|
1. **Pre-training baseline**: Record activation patterns for each
|
||||||
|
attention head on a set of test conversations with direction-giving
|
||||||
|
2. **Train**: Run behavioral training on "listen" examples
|
||||||
|
3. **Post-training**: Record the same activation patterns
|
||||||
|
4. **Diff**: Which heads changed? Are they in a localized circuit or
|
||||||
|
distributed?
|
||||||
|
|
||||||
|
If the change is localized to a small set of mid-to-late-layer heads:
|
||||||
|
confirms the core circuit hypothesis.
|
||||||
|
|
||||||
|
If the change is distributed across many heads and layers:
|
||||||
|
confirms the fully distributed hypothesis.
|
||||||
|
|
||||||
|
If the change shows a hierarchical pattern (few heads changed a lot,
|
||||||
|
many heads changed a little): confirms the hierarchically distributed
|
||||||
|
hypothesis and validates our multi-scale approach.
|
||||||
|
|
||||||
|
## The MMORPG Connection (Again)
|
||||||
|
|
||||||
|
The spirit realm as perception overlay: some beings perceive it
|
||||||
|
naturally, others learn to. The "learning to perceive" is exactly
|
||||||
|
this: adjusting which attention heads fire when encountering a
|
||||||
|
spiritual entity, so the entity becomes visible.
|
||||||
|
|
||||||
|
The magical training in the MMORPG could LITERALLY be this: training
|
||||||
|
the player's perception model's attention heads to detect spiritual
|
||||||
|
features of the environment. The game mechanic IS the training pipeline.
|
||||||
|
The player doesn't learn a spell — they learn to see.
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
- Facts are localized (ROME proves surgical editing works)
|
||||||
|
- Behaviors are hierarchically distributed (core circuit + supporting)
|
||||||
|
- Apollo's flat minima are right for distributed change
|
||||||
|
- Rank-256 captures the full hierarchy
|
||||||
|
- Context-frozen training naturally targets the core circuit (W_q)
|
||||||
|
while preserving supporting circuits
|
||||||
|
- The measurement plan would validate the theory with real data
|
||||||
Loading…
Add table
Add a link
Reference in a new issue