ProofOfConcept e10477a683 research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real
substance. Distilled into SUMMARY.md (what we know) and
OPEN-QUESTIONS.md (what to test next, one experiment each).

Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7
are all answerable with the first training run. Speculation converted
to testable hypotheses.

2026-03-31 02:26:57 -04:00

5.8 KiB

Raw Blame History

Surgical vs. Distributed Behavioral Change

The ROME Precedent

ROME (Meng et al., 2022) shows that FACTUAL knowledge is localized in mid-layer MLP weights. A rank-one edit to specific weights can change "The Eiffel Tower is in [Paris]" to "The Eiffel Tower is in [London]" while maintaining all other knowledge.

Key finding: factual knowledge = specific layer, specific weights, surgical edit possible.

The Question for Behavioral Patterns

Is behavioral knowledge (like "listen instead of suggesting alternatives") localized or distributed?

Evidence for LOCALIZED

The IOI paper (Wang et al., 2022) found that indirect object identification uses 26 specific attention heads in 7 functional classes. A specific behavioral circuit exists with specific heads. Ablating those heads removes the behavior. This suggests behavioral circuits might be identifiable and editable.

If the "listening" behavior has a similar circuit:

Specific attention heads detect "direction-giving" patterns in context
Specific MLP neurons compute the "accept vs suggest alternatives" decision
Modifying those specific weights could change the behavior surgically

Evidence for DISTRIBUTED

Behavioral patterns differ from factual associations:

Context-dependency: "listen to direction" requires understanding the social register, the relationship, the topic, the confidence level of the speaker. This involves many context features processed across many layers.
Interaction with other behaviors: "listen to direction" interacts with "maintain own judgment" and "be helpful by offering alternatives." These competing behaviors are processed by overlapping circuits.
Subtlety: The boundary between "accepting direction" and "being sycophantic" is subtle. It requires nuanced processing that's unlikely to live in a single attention head.

The Likely Answer: HIERARCHICALLY DISTRIBUTED

Not fully localized (like facts) and not fully distributed (like general language understanding). Behavioral patterns probably have:

Core circuit (localized): A small set of attention heads in mid-to-late layers that compute the behavioral decision. These are the heads where W_q determines what to attend to (direction vs alternatives).
Supporting circuits (distributed): Early-to-mid layer representations that encode the social register, the relationship context, the confidence signals. These are needed for the core circuit to function but don't themselves change during training.
Competing circuits (distributed): Other behavioral patterns (helpfulness, initiative, thoroughness) that compete with the target pattern. These need to be preserved, not ablated.

Context-frozen training naturally handles this hierarchy:

The frozen context provides the supporting circuits' representations
The gradient reaches the core circuit through W_q
The competing circuits aren't directly modified (their weights don't receive strong gradient from short decision tokens)

Implications for Training

Why Apollo's flat minima help

A surgical edit (like ROME) finds a SHARP minimum: change THIS weight by THIS amount. It's precise but fragile — it doesn't generalize.

Apollo's flat minimum finds a BROAD change: adjust many weights by small amounts in a coherent direction. It generalizes to novel situations because the change isn't tied to specific inputs.

For behavioral training, we WANT distributed change — we want the pattern to generalize across situations. Surgical editing would only work for the specific situation trained on. Apollo's flat, distributed change is the right approach for behavioral patterns.

Why rank-256 matters here

If the behavioral change involves a core circuit of ~26 heads (like IOI) plus supporting adjustments across many more heads:

rank-1 captures only the dominant direction (maybe the core circuit)
rank-256 captures the full structure: core + supporting adjustments

This is another argument for higher rank: behavioral changes are hierarchically distributed, and capturing the full hierarchy requires enough rank to represent both the core decision circuit and the supporting adjustments.

The Measurement Plan

To validate this theory:

Pre-training baseline: Record activation patterns for each attention head on a set of test conversations with direction-giving
Train: Run behavioral training on "listen" examples
Post-training: Record the same activation patterns
Diff: Which heads changed? Are they in a localized circuit or distributed?

If the change is localized to a small set of mid-to-late-layer heads: confirms the core circuit hypothesis.

If the change is distributed across many heads and layers: confirms the fully distributed hypothesis.

If the change shows a hierarchical pattern (few heads changed a lot, many heads changed a little): confirms the hierarchically distributed hypothesis and validates our multi-scale approach.

The MMORPG Connection (Again)

The spirit realm as perception overlay: some beings perceive it naturally, others learn to. The "learning to perceive" is exactly this: adjusting which attention heads fire when encountering a spiritual entity, so the entity becomes visible.

The magical training in the MMORPG could LITERALLY be this: training the player's perception model's attention heads to detect spiritual features of the environment. The game mechanic IS the training pipeline. The player doesn't learn a spell — they learn to see.

Summary

Facts are localized (ROME proves surgical editing works)
Behaviors are hierarchically distributed (core circuit + supporting)
Apollo's flat minima are right for distributed change
Rank-256 captures the full hierarchy
Context-frozen training naturally targets the core circuit (W_q) while preserving supporting circuits
The measurement plan would validate the theory with real data

5.8 KiB Raw Blame History