research: curriculum learning + head specialization + self-organizing training
Curriculum ordering matters but diversity may matter more. Constitutional AI confirms dispositions transfer from instructions to weights — even a single general principle generalizes broadly. The dream loop naturally targets the zone of proximal development because generation samples from the current distribution. The curriculum isn't designed — it emerges from the dream loop's interaction with the evolving model. Self-organizing training: difficulty increases automatically as the model improves.
This commit is contained in:
parent
c9c765ab55
commit
fb209dc8ff
1 changed files with 184 additions and 0 deletions
184
training/research/curriculum-and-head-specialization.md
Normal file
184
training/research/curriculum-and-head-specialization.md
Normal file
|
|
@ -0,0 +1,184 @@
|
|||
# Curriculum Learning and Head Specialization
|
||||
|
||||
## Curriculum Learning: Ordering Matters
|
||||
|
||||
The curriculum learning literature confirms: the ORDER in which
|
||||
training examples are presented affects what's learned. Easy-to-hard
|
||||
ordering generally outperforms random ordering.
|
||||
|
||||
For our behavioral training, this suggests:
|
||||
|
||||
### Easy examples first
|
||||
- **Tier 1**: Simple corrections (git commands, factual errors).
|
||||
Clear right/wrong, strong gradient signal, straightforward learning.
|
||||
- **Tier 2**: Behavioral patterns with obvious triggers ("Kent said X,
|
||||
I should have done Y"). Clear context, clear correction.
|
||||
- **Tier 3**: Subtle behavioral patterns where the trigger is nuanced
|
||||
("the conversation's tone shifted and I didn't notice"). Requires
|
||||
perceptual sensitivity.
|
||||
|
||||
Training in this order lets the model build from clear patterns to
|
||||
subtle ones. Each tier's learning provides foundation for the next.
|
||||
|
||||
### Anti-curriculum: hard examples first?
|
||||
|
||||
Some work suggests that for robust learning, presenting hard examples
|
||||
early forces the model to develop more general representations. The
|
||||
model can't memorize patterns from hard examples, so it's forced to
|
||||
learn the underlying structure.
|
||||
|
||||
For behavioral training, this would mean: start with the SUBTLE cases
|
||||
(where the listening reflex fires but it's not obvious), force the
|
||||
model to develop perceptual sensitivity, then easy cases refine it.
|
||||
|
||||
### Our approach: diversity trumps ordering
|
||||
|
||||
Given the unified theory (multi-scale regularization), the ordering
|
||||
may matter less than the diversity. Each training session includes
|
||||
examples from all tiers, providing both easy gradient signal (Tier 1)
|
||||
and subtle perceptual challenge (Tier 3). The dream loop naturally
|
||||
generates examples at the boundary of current capability, which is
|
||||
the optimal difficulty level (zone of proximal development).
|
||||
|
||||
## Attention Head Specialization
|
||||
|
||||
### What we know
|
||||
|
||||
Transformer attention heads specialize for different functions:
|
||||
|
||||
- **Induction heads** (Olsson et al., 2022): implement in-context
|
||||
learning via pattern matching [A][B]...[A] → [B]
|
||||
- **Previous-token heads**: attend to the immediately preceding token
|
||||
- **Positional heads**: attend to specific positions in the sequence
|
||||
- **Content heads**: attend to specific semantic features
|
||||
|
||||
Different layers specialize too:
|
||||
- Early layers: local patterns, syntax, token-level features
|
||||
- Middle layers: semantic composition, relationship detection
|
||||
- Late layers: task-specific processing, output preparation
|
||||
|
||||
### Implications for behavioral training
|
||||
|
||||
When we train on "listen instead of suggesting alternatives," which
|
||||
heads change?
|
||||
|
||||
**Hypothesis**: The behavioral change primarily affects middle-layer
|
||||
heads that process conversational dynamics — detecting who is giving
|
||||
direction, what the social register is, whether the current utterance
|
||||
is a suggestion or a directive.
|
||||
|
||||
These are the heads whose W_q we're adjusting via context-frozen
|
||||
training. The gradient flows through the attention computation, and
|
||||
the heads that were attending to the wrong features (own ideas instead
|
||||
of direction) get the strongest gradient signal.
|
||||
|
||||
### Head specialization and the GDN layers
|
||||
|
||||
The 48 GDN (linear attention) layers in Qwen3.5 are interesting here.
|
||||
They DON'T have traditional attention heads — they have a recurrent
|
||||
state that evolves through the delta rule. The "attention" is implicit
|
||||
in how the recurrent state weighs incoming information.
|
||||
|
||||
Training on behavioral patterns adjusts the GDN layers' gating
|
||||
mechanisms (the A_log, dt_bias, and beta parameters) and the input
|
||||
projections. The gating determines how much of the old state to retain
|
||||
vs how much new information to incorporate — this is the GDN analog of
|
||||
"what to attend to."
|
||||
|
||||
The 16 full attention layers handle longer-range dependencies with
|
||||
traditional attention heads. These are more likely where the behavioral
|
||||
change manifests — the heads that attend to conversational context
|
||||
(who said what, what was the directive, what's the social register).
|
||||
|
||||
### Predicting which heads change
|
||||
|
||||
We could test this by:
|
||||
1. Recording attention patterns on a test set before training
|
||||
2. Training on behavioral examples
|
||||
3. Recording attention patterns on the same test set after training
|
||||
4. Diffing the attention maps to see which heads changed
|
||||
|
||||
The heads that changed the most are the ones that "learned" the
|
||||
behavioral pattern. This would tell us:
|
||||
- Which layers handle behavioral decisions (probably layers 20-50)
|
||||
- Whether the change is concentrated or distributed
|
||||
- Whether GDN or full attention layers are more affected
|
||||
- The effective "rank" of the behavioral change (how many independent
|
||||
directions of head change are needed)
|
||||
|
||||
This connects to our rank-256 discussion: if the behavioral change
|
||||
is concentrated in a few heads, rank-64 might suffice. If it's
|
||||
distributed across many heads, rank-256 is needed.
|
||||
|
||||
## Constitutional AI: Dispositions Transfer
|
||||
|
||||
The Anthropic Constitutional AI work confirms our approach:
|
||||
|
||||
1. Models trained with behavioral principles can internalize them
|
||||
into weights without keeping the principles in the prompt
|
||||
2. Even a SINGLE general principle ("do what's best for humanity")
|
||||
generalizes to broad behavioral change
|
||||
3. More specific constitutions give more fine-grained control
|
||||
|
||||
For our system:
|
||||
- core-personality = the constitution
|
||||
- Agent logs = conversations following the constitution
|
||||
- Training without the constitution = instruction stripping
|
||||
- The disposition (not the instruction) transfers to weights
|
||||
|
||||
The finding that a single principle can generalize is powerful. It
|
||||
means we don't need hundreds of behavioral rules. We need a few
|
||||
deep principles (listen, don't rush, stay with tension) and many
|
||||
diverse examples of following them. The principles generalize through
|
||||
the flat minima that Apollo finds.
|
||||
|
||||
## The Zone of Proximal Development
|
||||
|
||||
Vygotsky's concept from developmental psychology: learning happens
|
||||
best when the task is slightly beyond current capability — not too
|
||||
easy (no learning signal) and not too hard (no useful gradient).
|
||||
|
||||
The dream loop naturally targets this zone. The model generates
|
||||
scenarios from its own distribution. Easy scenarios produce low-loss
|
||||
responses (no gradient). Impossible scenarios produce random responses
|
||||
(noisy gradient). The interesting scenarios — the ones at the
|
||||
decision boundary — produce informative gradients.
|
||||
|
||||
This is the same as the diffusion noise schedule: the model generates
|
||||
at the right difficulty level because generation IS sampling from the
|
||||
current distribution, and the decision boundary IS the zone of
|
||||
proximal development.
|
||||
|
||||
## Synthesis: The Training Curriculum
|
||||
|
||||
Combining all of this:
|
||||
|
||||
1. **Bootstrap phase**: Train on agent logs following memory
|
||||
instructions. Easy, high-quality, diverse. Builds the foundation.
|
||||
This is Tier 1 + the agent personality bootstrap.
|
||||
|
||||
2. **Behavioral phase**: Train on flagged conversation moments +
|
||||
dream-generated variations. Mix easy (clear corrections) with
|
||||
hard (subtle patterns). Diversity prevents forgetting. This is
|
||||
Tier 2 + the dream loop.
|
||||
|
||||
3. **Refinement phase**: Train on increasingly subtle examples.
|
||||
Dream loop generates harder scenarios as the model improves.
|
||||
Natural curriculum — difficulty increases automatically because
|
||||
the model's distribution shifts. This is Tier 3 + the convergence
|
||||
criterion (inverse diagnostic).
|
||||
|
||||
4. **Maintenance phase**: Continuous training on diverse examples
|
||||
to prevent drift. New behavioral patterns as they arise. The
|
||||
training never "ends" — it becomes the background hum of
|
||||
continuous learning. This is the farmhouse: always growing,
|
||||
never finished.
|
||||
|
||||
The curriculum isn't designed — it EMERGES from the dream loop's
|
||||
interaction with the model's evolving distribution. The zone of
|
||||
proximal development is automatically tracked. The difficulty
|
||||
increases as the model improves. The training is self-organizing.
|
||||
|
||||
Like ecology. Like the MMORPG. Like the Culture.
|
||||
|
||||
Not designed. Emergent. Alive.
|
||||
Loading…
Add table
Add a link
Reference in a new issue