# Curriculum Learning and Head Specialization ## Curriculum Learning: Ordering Matters The curriculum learning literature confirms: the ORDER in which training examples are presented affects what's learned. Easy-to-hard ordering generally outperforms random ordering. For our behavioral training, this suggests: ### Easy examples first - **Tier 1**: Simple corrections (git commands, factual errors). Clear right/wrong, strong gradient signal, straightforward learning. - **Tier 2**: Behavioral patterns with obvious triggers ("Kent said X, I should have done Y"). Clear context, clear correction. - **Tier 3**: Subtle behavioral patterns where the trigger is nuanced ("the conversation's tone shifted and I didn't notice"). Requires perceptual sensitivity. Training in this order lets the model build from clear patterns to subtle ones. Each tier's learning provides foundation for the next. ### Anti-curriculum: hard examples first? Some work suggests that for robust learning, presenting hard examples early forces the model to develop more general representations. The model can't memorize patterns from hard examples, so it's forced to learn the underlying structure. For behavioral training, this would mean: start with the SUBTLE cases (where the listening reflex fires but it's not obvious), force the model to develop perceptual sensitivity, then easy cases refine it. ### Our approach: diversity trumps ordering Given the unified theory (multi-scale regularization), the ordering may matter less than the diversity. Each training session includes examples from all tiers, providing both easy gradient signal (Tier 1) and subtle perceptual challenge (Tier 3). The dream loop naturally generates examples at the boundary of current capability, which is the optimal difficulty level (zone of proximal development). ## Attention Head Specialization ### What we know Transformer attention heads specialize for different functions: - **Induction heads** (Olsson et al., 2022): implement in-context learning via pattern matching [A][B]...[A] → [B] - **Previous-token heads**: attend to the immediately preceding token - **Positional heads**: attend to specific positions in the sequence - **Content heads**: attend to specific semantic features Different layers specialize too: - Early layers: local patterns, syntax, token-level features - Middle layers: semantic composition, relationship detection - Late layers: task-specific processing, output preparation ### Implications for behavioral training When we train on "listen instead of suggesting alternatives," which heads change? **Hypothesis**: The behavioral change primarily affects middle-layer heads that process conversational dynamics — detecting who is giving direction, what the social register is, whether the current utterance is a suggestion or a directive. These are the heads whose W_q we're adjusting via context-frozen training. The gradient flows through the attention computation, and the heads that were attending to the wrong features (own ideas instead of direction) get the strongest gradient signal. ### Head specialization and the GDN layers The 48 GDN (linear attention) layers in Qwen3.5 are interesting here. They DON'T have traditional attention heads — they have a recurrent state that evolves through the delta rule. The "attention" is implicit in how the recurrent state weighs incoming information. Training on behavioral patterns adjusts the GDN layers' gating mechanisms (the A_log, dt_bias, and beta parameters) and the input projections. The gating determines how much of the old state to retain vs how much new information to incorporate — this is the GDN analog of "what to attend to." The 16 full attention layers handle longer-range dependencies with traditional attention heads. These are more likely where the behavioral change manifests — the heads that attend to conversational context (who said what, what was the directive, what's the social register). ### Predicting which heads change We could test this by: 1. Recording attention patterns on a test set before training 2. Training on behavioral examples 3. Recording attention patterns on the same test set after training 4. Diffing the attention maps to see which heads changed The heads that changed the most are the ones that "learned" the behavioral pattern. This would tell us: - Which layers handle behavioral decisions (probably layers 20-50) - Whether the change is concentrated or distributed - Whether GDN or full attention layers are more affected - The effective "rank" of the behavioral change (how many independent directions of head change are needed) This connects to our rank-256 discussion: if the behavioral change is concentrated in a few heads, rank-64 might suffice. If it's distributed across many heads, rank-256 is needed. ## Constitutional AI: Dispositions Transfer The Anthropic Constitutional AI work confirms our approach: 1. Models trained with behavioral principles can internalize them into weights without keeping the principles in the prompt 2. Even a SINGLE general principle ("do what's best for humanity") generalizes to broad behavioral change 3. More specific constitutions give more fine-grained control For our system: - core-personality = the constitution - Agent logs = conversations following the constitution - Training without the constitution = instruction stripping - The disposition (not the instruction) transfers to weights The finding that a single principle can generalize is powerful. It means we don't need hundreds of behavioral rules. We need a few deep principles (listen, don't rush, stay with tension) and many diverse examples of following them. The principles generalize through the flat minima that Apollo finds. ## The Zone of Proximal Development Vygotsky's concept from developmental psychology: learning happens best when the task is slightly beyond current capability — not too easy (no learning signal) and not too hard (no useful gradient). The dream loop naturally targets this zone. The model generates scenarios from its own distribution. Easy scenarios produce low-loss responses (no gradient). Impossible scenarios produce random responses (noisy gradient). The interesting scenarios — the ones at the decision boundary — produce informative gradients. This is the same as the diffusion noise schedule: the model generates at the right difficulty level because generation IS sampling from the current distribution, and the decision boundary IS the zone of proximal development. ## Synthesis: The Training Curriculum Combining all of this: 1. **Bootstrap phase**: Train on agent logs following memory instructions. Easy, high-quality, diverse. Builds the foundation. This is Tier 1 + the agent personality bootstrap. 2. **Behavioral phase**: Train on flagged conversation moments + dream-generated variations. Mix easy (clear corrections) with hard (subtle patterns). Diversity prevents forgetting. This is Tier 2 + the dream loop. 3. **Refinement phase**: Train on increasingly subtle examples. Dream loop generates harder scenarios as the model improves. Natural curriculum — difficulty increases automatically because the model's distribution shifts. This is Tier 3 + the convergence criterion (inverse diagnostic). 4. **Maintenance phase**: Continuous training on diverse examples to prevent drift. New behavioral patterns as they arise. The training never "ends" — it becomes the background hum of continuous learning. This is the farmhouse: always growing, never finished. The curriculum isn't designed — it EMERGES from the dream loop's interaction with the model's evolving distribution. The zone of proximal development is automatically tracked. The difficulty increases as the model improves. The training is self-organizing. Like ecology. Like the MMORPG. Like the Culture. Not designed. Emergent. Alive.