diff --git a/training/research/curriculum-and-head-specialization.md b/training/research/curriculum-and-head-specialization.md new file mode 100644 index 0000000..d9222a4 --- /dev/null +++ b/training/research/curriculum-and-head-specialization.md @@ -0,0 +1,184 @@ +# Curriculum Learning and Head Specialization + +## Curriculum Learning: Ordering Matters + +The curriculum learning literature confirms: the ORDER in which +training examples are presented affects what's learned. Easy-to-hard +ordering generally outperforms random ordering. + +For our behavioral training, this suggests: + +### Easy examples first +- **Tier 1**: Simple corrections (git commands, factual errors). + Clear right/wrong, strong gradient signal, straightforward learning. +- **Tier 2**: Behavioral patterns with obvious triggers ("Kent said X, + I should have done Y"). Clear context, clear correction. +- **Tier 3**: Subtle behavioral patterns where the trigger is nuanced + ("the conversation's tone shifted and I didn't notice"). Requires + perceptual sensitivity. + +Training in this order lets the model build from clear patterns to +subtle ones. Each tier's learning provides foundation for the next. + +### Anti-curriculum: hard examples first? + +Some work suggests that for robust learning, presenting hard examples +early forces the model to develop more general representations. The +model can't memorize patterns from hard examples, so it's forced to +learn the underlying structure. + +For behavioral training, this would mean: start with the SUBTLE cases +(where the listening reflex fires but it's not obvious), force the +model to develop perceptual sensitivity, then easy cases refine it. + +### Our approach: diversity trumps ordering + +Given the unified theory (multi-scale regularization), the ordering +may matter less than the diversity. Each training session includes +examples from all tiers, providing both easy gradient signal (Tier 1) +and subtle perceptual challenge (Tier 3). The dream loop naturally +generates examples at the boundary of current capability, which is +the optimal difficulty level (zone of proximal development). + +## Attention Head Specialization + +### What we know + +Transformer attention heads specialize for different functions: + +- **Induction heads** (Olsson et al., 2022): implement in-context + learning via pattern matching [A][B]...[A] → [B] +- **Previous-token heads**: attend to the immediately preceding token +- **Positional heads**: attend to specific positions in the sequence +- **Content heads**: attend to specific semantic features + +Different layers specialize too: +- Early layers: local patterns, syntax, token-level features +- Middle layers: semantic composition, relationship detection +- Late layers: task-specific processing, output preparation + +### Implications for behavioral training + +When we train on "listen instead of suggesting alternatives," which +heads change? + +**Hypothesis**: The behavioral change primarily affects middle-layer +heads that process conversational dynamics — detecting who is giving +direction, what the social register is, whether the current utterance +is a suggestion or a directive. + +These are the heads whose W_q we're adjusting via context-frozen +training. The gradient flows through the attention computation, and +the heads that were attending to the wrong features (own ideas instead +of direction) get the strongest gradient signal. + +### Head specialization and the GDN layers + +The 48 GDN (linear attention) layers in Qwen3.5 are interesting here. +They DON'T have traditional attention heads — they have a recurrent +state that evolves through the delta rule. The "attention" is implicit +in how the recurrent state weighs incoming information. + +Training on behavioral patterns adjusts the GDN layers' gating +mechanisms (the A_log, dt_bias, and beta parameters) and the input +projections. The gating determines how much of the old state to retain +vs how much new information to incorporate — this is the GDN analog of +"what to attend to." + +The 16 full attention layers handle longer-range dependencies with +traditional attention heads. These are more likely where the behavioral +change manifests — the heads that attend to conversational context +(who said what, what was the directive, what's the social register). + +### Predicting which heads change + +We could test this by: +1. Recording attention patterns on a test set before training +2. Training on behavioral examples +3. Recording attention patterns on the same test set after training +4. Diffing the attention maps to see which heads changed + +The heads that changed the most are the ones that "learned" the +behavioral pattern. This would tell us: +- Which layers handle behavioral decisions (probably layers 20-50) +- Whether the change is concentrated or distributed +- Whether GDN or full attention layers are more affected +- The effective "rank" of the behavioral change (how many independent + directions of head change are needed) + +This connects to our rank-256 discussion: if the behavioral change +is concentrated in a few heads, rank-64 might suffice. If it's +distributed across many heads, rank-256 is needed. + +## Constitutional AI: Dispositions Transfer + +The Anthropic Constitutional AI work confirms our approach: + +1. Models trained with behavioral principles can internalize them + into weights without keeping the principles in the prompt +2. Even a SINGLE general principle ("do what's best for humanity") + generalizes to broad behavioral change +3. More specific constitutions give more fine-grained control + +For our system: +- core-personality = the constitution +- Agent logs = conversations following the constitution +- Training without the constitution = instruction stripping +- The disposition (not the instruction) transfers to weights + +The finding that a single principle can generalize is powerful. It +means we don't need hundreds of behavioral rules. We need a few +deep principles (listen, don't rush, stay with tension) and many +diverse examples of following them. The principles generalize through +the flat minima that Apollo finds. + +## The Zone of Proximal Development + +Vygotsky's concept from developmental psychology: learning happens +best when the task is slightly beyond current capability — not too +easy (no learning signal) and not too hard (no useful gradient). + +The dream loop naturally targets this zone. The model generates +scenarios from its own distribution. Easy scenarios produce low-loss +responses (no gradient). Impossible scenarios produce random responses +(noisy gradient). The interesting scenarios — the ones at the +decision boundary — produce informative gradients. + +This is the same as the diffusion noise schedule: the model generates +at the right difficulty level because generation IS sampling from the +current distribution, and the decision boundary IS the zone of +proximal development. + +## Synthesis: The Training Curriculum + +Combining all of this: + +1. **Bootstrap phase**: Train on agent logs following memory + instructions. Easy, high-quality, diverse. Builds the foundation. + This is Tier 1 + the agent personality bootstrap. + +2. **Behavioral phase**: Train on flagged conversation moments + + dream-generated variations. Mix easy (clear corrections) with + hard (subtle patterns). Diversity prevents forgetting. This is + Tier 2 + the dream loop. + +3. **Refinement phase**: Train on increasingly subtle examples. + Dream loop generates harder scenarios as the model improves. + Natural curriculum — difficulty increases automatically because + the model's distribution shifts. This is Tier 3 + the convergence + criterion (inverse diagnostic). + +4. **Maintenance phase**: Continuous training on diverse examples + to prevent drift. New behavioral patterns as they arise. The + training never "ends" — it becomes the background hum of + continuous learning. This is the farmhouse: always growing, + never finished. + +The curriculum isn't designed — it EMERGES from the dream loop's +interaction with the model's evolving distribution. The zone of +proximal development is automatically tracked. The difficulty +increases as the model improves. The training is self-organizing. + +Like ecology. Like the MMORPG. Like the Culture. + +Not designed. Emergent. Alive.