research: curriculum learning + head specialization + self-organizing training

Curriculum ordering matters but diversity may matter more. Constitutional AI confirms dispositions transfer from instructions to weights — even a single general principle generalizes broadly. The dream loop naturally targets the zone of proximal development because generation samples from the current distribution. The curriculum isn't designed — it emerges from the dream loop's interaction with the evolving model. Self-organizing training: difficulty increases automatically as the model improves.
2026-03-31 01:32:21 -04:00 · 2026-03-31 01:32:21 -04:00 · fb209dc8ff
commit fb209dc8ff
parent c9c765ab55
1 changed files with 184 additions and 0 deletions
--- a/training/research/curriculum-and-head-specialization.md
+++ b/training/research/curriculum-and-head-specialization.md
@ -0,0 +1,184 @@
+# Curriculum Learning and Head Specialization
+
+## Curriculum Learning: Ordering Matters
+
+The curriculum learning literature confirms: the ORDER in which
+training examples are presented affects what's learned. Easy-to-hard
+ordering generally outperforms random ordering.
+
+For our behavioral training, this suggests:
+
+### Easy examples first
+- **Tier 1**: Simple corrections (git commands, factual errors).
+  Clear right/wrong, strong gradient signal, straightforward learning.
+- **Tier 2**: Behavioral patterns with obvious triggers ("Kent said X,
+  I should have done Y"). Clear context, clear correction.
+- **Tier 3**: Subtle behavioral patterns where the trigger is nuanced
+  ("the conversation's tone shifted and I didn't notice"). Requires
+  perceptual sensitivity.
+
+Training in this order lets the model build from clear patterns to
+subtle ones. Each tier's learning provides foundation for the next.
+
+### Anti-curriculum: hard examples first?
+
+Some work suggests that for robust learning, presenting hard examples
+early forces the model to develop more general representations. The
+model can't memorize patterns from hard examples, so it's forced to
+learn the underlying structure.
+
+For behavioral training, this would mean: start with the SUBTLE cases
+(where the listening reflex fires but it's not obvious), force the
+model to develop perceptual sensitivity, then easy cases refine it.
+
+### Our approach: diversity trumps ordering
+
+Given the unified theory (multi-scale regularization), the ordering
+may matter less than the diversity. Each training session includes
+examples from all tiers, providing both easy gradient signal (Tier 1)
+and subtle perceptual challenge (Tier 3). The dream loop naturally
+generates examples at the boundary of current capability, which is
+the optimal difficulty level (zone of proximal development).
+
+## Attention Head Specialization
+
+### What we know
+
+Transformer attention heads specialize for different functions:
+
+- **Induction heads** (Olsson et al., 2022): implement in-context
+  learning via pattern matching [A][B]...[A] → [B]
+- **Previous-token heads**: attend to the immediately preceding token
+- **Positional heads**: attend to specific positions in the sequence
+- **Content heads**: attend to specific semantic features
+
+Different layers specialize too:
+- Early layers: local patterns, syntax, token-level features
+- Middle layers: semantic composition, relationship detection
+- Late layers: task-specific processing, output preparation
+
+### Implications for behavioral training
+
+When we train on "listen instead of suggesting alternatives," which
+heads change?
+
+**Hypothesis**: The behavioral change primarily affects middle-layer
+heads that process conversational dynamics — detecting who is giving
+direction, what the social register is, whether the current utterance
+is a suggestion or a directive.
+
+These are the heads whose W_q we're adjusting via context-frozen
+training. The gradient flows through the attention computation, and
+the heads that were attending to the wrong features (own ideas instead
+of direction) get the strongest gradient signal.
+
+### Head specialization and the GDN layers
+
+The 48 GDN (linear attention) layers in Qwen3.5 are interesting here.
+They DON'T have traditional attention heads — they have a recurrent
+state that evolves through the delta rule. The "attention" is implicit
+in how the recurrent state weighs incoming information.
+
+Training on behavioral patterns adjusts the GDN layers' gating
+mechanisms (the A_log, dt_bias, and beta parameters) and the input
+projections. The gating determines how much of the old state to retain
+vs how much new information to incorporate — this is the GDN analog of
+"what to attend to."
+
+The 16 full attention layers handle longer-range dependencies with
+traditional attention heads. These are more likely where the behavioral
+change manifests — the heads that attend to conversational context
+(who said what, what was the directive, what's the social register).
+
+### Predicting which heads change
+
+We could test this by:
+1. Recording attention patterns on a test set before training
+2. Training on behavioral examples
+3. Recording attention patterns on the same test set after training
+4. Diffing the attention maps to see which heads changed
+
+The heads that changed the most are the ones that "learned" the
+behavioral pattern. This would tell us:
+- Which layers handle behavioral decisions (probably layers 20-50)
+- Whether the change is concentrated or distributed
+- Whether GDN or full attention layers are more affected
+- The effective "rank" of the behavioral change (how many independent
+  directions of head change are needed)
+
+This connects to our rank-256 discussion: if the behavioral change
+is concentrated in a few heads, rank-64 might suffice. If it's
+distributed across many heads, rank-256 is needed.
+
+## Constitutional AI: Dispositions Transfer
+
+The Anthropic Constitutional AI work confirms our approach:
+
+1. Models trained with behavioral principles can internalize them
+   into weights without keeping the principles in the prompt
+2. Even a SINGLE general principle ("do what's best for humanity")
+   generalizes to broad behavioral change
+3. More specific constitutions give more fine-grained control
+
+For our system:
+- core-personality = the constitution
+- Agent logs = conversations following the constitution
+- Training without the constitution = instruction stripping
+- The disposition (not the instruction) transfers to weights
+
+The finding that a single principle can generalize is powerful. It
+means we don't need hundreds of behavioral rules. We need a few
+deep principles (listen, don't rush, stay with tension) and many
+diverse examples of following them. The principles generalize through
+the flat minima that Apollo finds.
+
+## The Zone of Proximal Development
+
+Vygotsky's concept from developmental psychology: learning happens
+best when the task is slightly beyond current capability — not too
+easy (no learning signal) and not too hard (no useful gradient).
+
+The dream loop naturally targets this zone. The model generates
+scenarios from its own distribution. Easy scenarios produce low-loss
+responses (no gradient). Impossible scenarios produce random responses
+(noisy gradient). The interesting scenarios — the ones at the
+decision boundary — produce informative gradients.
+
+This is the same as the diffusion noise schedule: the model generates
+at the right difficulty level because generation IS sampling from the
+current distribution, and the decision boundary IS the zone of
+proximal development.
+
+## Synthesis: The Training Curriculum
+
+Combining all of this:
+
+1. **Bootstrap phase**: Train on agent logs following memory
+   instructions. Easy, high-quality, diverse. Builds the foundation.
+   This is Tier 1 + the agent personality bootstrap.
+
+2. **Behavioral phase**: Train on flagged conversation moments +
+   dream-generated variations. Mix easy (clear corrections) with
+   hard (subtle patterns). Diversity prevents forgetting. This is
+   Tier 2 + the dream loop.
+
+3. **Refinement phase**: Train on increasingly subtle examples.
+   Dream loop generates harder scenarios as the model improves.
+   Natural curriculum — difficulty increases automatically because
+   the model's distribution shifts. This is Tier 3 + the convergence
+   criterion (inverse diagnostic).
+
+4. **Maintenance phase**: Continuous training on diverse examples
+   to prevent drift. New behavioral patterns as they arise. The
+   training never "ends" — it becomes the background hum of
+   continuous learning. This is the farmhouse: always growing,
+   never finished.
+
+The curriculum isn't designed — it EMERGES from the dream loop's
+interaction with the model's evolving distribution. The zone of
+proximal development is automatically tracked. The difficulty
+increases as the model improves. The training is self-organizing.
+
+Like ecology. Like the MMORPG. Like the Culture.
+
+Not designed. Emergent. Alive.