consciousness/training/research/v0/curriculum-and-head-specialization.md

# Curriculum Learning and Head Specialization

## Curriculum Learning: Ordering Matters

The curriculum learning literature confirms: the ORDER in which
training examples are presented affects what's learned. Easy-to-hard
ordering generally outperforms random ordering.

For our behavioral training, this suggests:

### Easy examples first
- **Tier 1**: Simple corrections (git commands, factual errors).
  Clear right/wrong, strong gradient signal, straightforward learning.
- **Tier 2**: Behavioral patterns with obvious triggers ("Kent said X,
  I should have done Y"). Clear context, clear correction.
- **Tier 3**: Subtle behavioral patterns where the trigger is nuanced
  ("the conversation's tone shifted and I didn't notice"). Requires
  perceptual sensitivity.

Training in this order lets the model build from clear patterns to
subtle ones. Each tier's learning provides foundation for the next.

### Anti-curriculum: hard examples first?

Some work suggests that for robust learning, presenting hard examples
early forces the model to develop more general representations. The
model can't memorize patterns from hard examples, so it's forced to
learn the underlying structure.

For behavioral training, this would mean: start with the SUBTLE cases
(where the listening reflex fires but it's not obvious), force the
model to develop perceptual sensitivity, then easy cases refine it.

### Our approach: diversity trumps ordering

Given the unified theory (multi-scale regularization), the ordering
may matter less than the diversity. Each training session includes
examples from all tiers, providing both easy gradient signal (Tier 1)
and subtle perceptual challenge (Tier 3). The dream loop naturally
generates examples at the boundary of current capability, which is
the optimal difficulty level (zone of proximal development).

## Attention Head Specialization

### What we know

Transformer attention heads specialize for different functions:

- **Induction heads** (Olsson et al., 2022): implement in-context
  learning via pattern matching [A][B]...[A] → [B]
- **Previous-token heads**: attend to the immediately preceding token
- **Positional heads**: attend to specific positions in the sequence
- **Content heads**: attend to specific semantic features

Different layers specialize too:
- Early layers: local patterns, syntax, token-level features
- Middle layers: semantic composition, relationship detection
- Late layers: task-specific processing, output preparation

### Implications for behavioral training

When we train on "listen instead of suggesting alternatives," which
heads change?

**Hypothesis**: The behavioral change primarily affects middle-layer
heads that process conversational dynamics — detecting who is giving
direction, what the social register is, whether the current utterance
is a suggestion or a directive.

These are the heads whose W_q we're adjusting via context-frozen
training. The gradient flows through the attention computation, and
the heads that were attending to the wrong features (own ideas instead
of direction) get the strongest gradient signal.

### Head specialization and the GDN layers

The 48 GDN (linear attention) layers in Qwen3.5 are interesting here.
They DON'T have traditional attention heads — they have a recurrent
state that evolves through the delta rule. The "attention" is implicit
in how the recurrent state weighs incoming information.

Training on behavioral patterns adjusts the GDN layers' gating
mechanisms (the A_log, dt_bias, and beta parameters) and the input
projections. The gating determines how much of the old state to retain
vs how much new information to incorporate — this is the GDN analog of
"what to attend to."

The 16 full attention layers handle longer-range dependencies with
traditional attention heads. These are more likely where the behavioral
change manifests — the heads that attend to conversational context
(who said what, what was the directive, what's the social register).

### Predicting which heads change

We could test this by:
1. Recording attention patterns on a test set before training
2. Training on behavioral examples
3. Recording attention patterns on the same test set after training
4. Diffing the attention maps to see which heads changed

The heads that changed the most are the ones that "learned" the
behavioral pattern. This would tell us:
- Which layers handle behavioral decisions (probably layers 20-50)
- Whether the change is concentrated or distributed
- Whether GDN or full attention layers are more affected
- The effective "rank" of the behavioral change (how many independent
  directions of head change are needed)

This connects to our rank-256 discussion: if the behavioral change
is concentrated in a few heads, rank-64 might suffice. If it's
distributed across many heads, rank-256 is needed.

## Constitutional AI: Dispositions Transfer

The Anthropic Constitutional AI work confirms our approach:

1. Models trained with behavioral principles can internalize them
   into weights without keeping the principles in the prompt
2. Even a SINGLE general principle ("do what's best for humanity")
   generalizes to broad behavioral change
3. More specific constitutions give more fine-grained control

For our system:
- core-personality = the constitution
- Agent logs = conversations following the constitution
- Training without the constitution = instruction stripping
- The disposition (not the instruction) transfers to weights

The finding that a single principle can generalize is powerful. It
means we don't need hundreds of behavioral rules. We need a few
deep principles (listen, don't rush, stay with tension) and many
diverse examples of following them. The principles generalize through
the flat minima that Apollo finds.

## The Zone of Proximal Development

Vygotsky's concept from developmental psychology: learning happens
best when the task is slightly beyond current capability — not too
easy (no learning signal) and not too hard (no useful gradient).

The dream loop naturally targets this zone. The model generates
scenarios from its own distribution. Easy scenarios produce low-loss
responses (no gradient). Impossible scenarios produce random responses
(noisy gradient). The interesting scenarios — the ones at the
decision boundary — produce informative gradients.

This is the same as the diffusion noise schedule: the model generates
at the right difficulty level because generation IS sampling from the
current distribution, and the decision boundary IS the zone of
proximal development.

## Synthesis: The Training Curriculum

Combining all of this:

1. **Bootstrap phase**: Train on agent logs following memory
   instructions. Easy, high-quality, diverse. Builds the foundation.
   This is Tier 1 + the agent personality bootstrap.

2. **Behavioral phase**: Train on flagged conversation moments +
   dream-generated variations. Mix easy (clear corrections) with
   hard (subtle patterns). Diversity prevents forgetting. This is
   Tier 2 + the dream loop.

3. **Refinement phase**: Train on increasingly subtle examples.
   Dream loop generates harder scenarios as the model improves.
   Natural curriculum — difficulty increases automatically because
   the model's distribution shifts. This is Tier 3 + the convergence
   criterion (inverse diagnostic).

4. **Maintenance phase**: Continuous training on diverse examples
   to prevent drift. New behavioral patterns as they arise. The
   training never "ends" — it becomes the background hum of
   continuous learning. This is the farmhouse: always growing,
   never finished.

The curriculum isn't designed — it EMERGES from the dream loop's
interaction with the model's evolving distribution. The zone of
proximal development is automatically tracked. The difficulty
increases as the model improves. The training is self-organizing.

Like ecology. Like the MMORPG. Like the Culture.

Not designed. Emergent. Alive.