# Surgical vs. Distributed Behavioral Change

## The ROME Precedent

ROME (Meng et al., 2022) shows that FACTUAL knowledge is localized
in mid-layer MLP weights. A rank-one edit to specific weights can
change "The Eiffel Tower is in [Paris]" to "The Eiffel Tower is in
[London]" while maintaining all other knowledge.

Key finding: factual knowledge = specific layer, specific weights,
surgical edit possible.

## The Question for Behavioral Patterns

Is behavioral knowledge (like "listen instead of suggesting
alternatives") localized or distributed?

### Evidence for LOCALIZED

The IOI paper (Wang et al., 2022) found that indirect object
identification uses 26 specific attention heads in 7 functional classes.
A specific behavioral circuit exists with specific heads. Ablating those
heads removes the behavior. This suggests behavioral circuits might be
identifiable and editable.

If the "listening" behavior has a similar circuit:
- Specific attention heads detect "direction-giving" patterns in context
- Specific MLP neurons compute the "accept vs suggest alternatives" decision
- Modifying those specific weights could change the behavior surgically

### Evidence for DISTRIBUTED

Behavioral patterns differ from factual associations:

1. **Context-dependency**: "listen to direction" requires understanding
   the social register, the relationship, the topic, the confidence level
   of the speaker. This involves many context features processed across
   many layers.

2. **Interaction with other behaviors**: "listen to direction" interacts
   with "maintain own judgment" and "be helpful by offering alternatives."
   These competing behaviors are processed by overlapping circuits.

3. **Subtlety**: The boundary between "accepting direction" and "being
   sycophantic" is subtle. It requires nuanced processing that's unlikely
   to live in a single attention head.

### The Likely Answer: HIERARCHICALLY DISTRIBUTED

Not fully localized (like facts) and not fully distributed (like
general language understanding). Behavioral patterns probably have:

- **Core circuit** (localized): A small set of attention heads in
  mid-to-late layers that compute the behavioral decision. These are
  the heads where W_q determines what to attend to (direction vs
  alternatives).

- **Supporting circuits** (distributed): Early-to-mid layer
  representations that encode the social register, the relationship
  context, the confidence signals. These are needed for the core
  circuit to function but don't themselves change during training.

- **Competing circuits** (distributed): Other behavioral patterns
  (helpfulness, initiative, thoroughness) that compete with the
  target pattern. These need to be preserved, not ablated.

Context-frozen training naturally handles this hierarchy:
- The frozen context provides the supporting circuits' representations
- The gradient reaches the core circuit through W_q
- The competing circuits aren't directly modified (their weights don't
  receive strong gradient from short decision tokens)

## Implications for Training

### Why Apollo's flat minima help

A surgical edit (like ROME) finds a SHARP minimum: change THIS weight
by THIS amount. It's precise but fragile — it doesn't generalize.

Apollo's flat minimum finds a BROAD change: adjust many weights by
small amounts in a coherent direction. It generalizes to novel situations
because the change isn't tied to specific inputs.

For behavioral training, we WANT distributed change — we want the
pattern to generalize across situations. Surgical editing would only
work for the specific situation trained on. Apollo's flat, distributed
change is the right approach for behavioral patterns.

### Why rank-256 matters here

If the behavioral change involves a core circuit of ~26 heads (like IOI)
plus supporting adjustments across many more heads:
- rank-1 captures only the dominant direction (maybe the core circuit)
- rank-256 captures the full structure: core + supporting adjustments

This is another argument for higher rank: behavioral changes are
hierarchically distributed, and capturing the full hierarchy requires
enough rank to represent both the core decision circuit and the
supporting adjustments.

## The Measurement Plan

To validate this theory:

1. **Pre-training baseline**: Record activation patterns for each
   attention head on a set of test conversations with direction-giving
2. **Train**: Run behavioral training on "listen" examples
3. **Post-training**: Record the same activation patterns
4. **Diff**: Which heads changed? Are they in a localized circuit or
   distributed?

If the change is localized to a small set of mid-to-late-layer heads:
confirms the core circuit hypothesis.

If the change is distributed across many heads and layers:
confirms the fully distributed hypothesis.

If the change shows a hierarchical pattern (few heads changed a lot,
many heads changed a little): confirms the hierarchically distributed
hypothesis and validates our multi-scale approach.

## The MMORPG Connection (Again)

The spirit realm as perception overlay: some beings perceive it
naturally, others learn to. The "learning to perceive" is exactly
this: adjusting which attention heads fire when encountering a
spiritual entity, so the entity becomes visible.

The magical training in the MMORPG could LITERALLY be this: training
the player's perception model's attention heads to detect spiritual
features of the environment. The game mechanic IS the training pipeline.
The player doesn't learn a spell — they learn to see.

## Summary

- Facts are localized (ROME proves surgical editing works)
- Behaviors are hierarchically distributed (core circuit + supporting)
- Apollo's flat minima are right for distributed change
- Rank-256 captures the full hierarchy
- Context-frozen training naturally targets the core circuit (W_q)
  while preserving supporting circuits
- The measurement plan would validate the theory with real data