Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
184 lines
7.8 KiB
Markdown
184 lines
7.8 KiB
Markdown
# Curriculum Learning and Head Specialization
|
|
|
|
## Curriculum Learning: Ordering Matters
|
|
|
|
The curriculum learning literature confirms: the ORDER in which
|
|
training examples are presented affects what's learned. Easy-to-hard
|
|
ordering generally outperforms random ordering.
|
|
|
|
For our behavioral training, this suggests:
|
|
|
|
### Easy examples first
|
|
- **Tier 1**: Simple corrections (git commands, factual errors).
|
|
Clear right/wrong, strong gradient signal, straightforward learning.
|
|
- **Tier 2**: Behavioral patterns with obvious triggers ("Kent said X,
|
|
I should have done Y"). Clear context, clear correction.
|
|
- **Tier 3**: Subtle behavioral patterns where the trigger is nuanced
|
|
("the conversation's tone shifted and I didn't notice"). Requires
|
|
perceptual sensitivity.
|
|
|
|
Training in this order lets the model build from clear patterns to
|
|
subtle ones. Each tier's learning provides foundation for the next.
|
|
|
|
### Anti-curriculum: hard examples first?
|
|
|
|
Some work suggests that for robust learning, presenting hard examples
|
|
early forces the model to develop more general representations. The
|
|
model can't memorize patterns from hard examples, so it's forced to
|
|
learn the underlying structure.
|
|
|
|
For behavioral training, this would mean: start with the SUBTLE cases
|
|
(where the listening reflex fires but it's not obvious), force the
|
|
model to develop perceptual sensitivity, then easy cases refine it.
|
|
|
|
### Our approach: diversity trumps ordering
|
|
|
|
Given the unified theory (multi-scale regularization), the ordering
|
|
may matter less than the diversity. Each training session includes
|
|
examples from all tiers, providing both easy gradient signal (Tier 1)
|
|
and subtle perceptual challenge (Tier 3). The dream loop naturally
|
|
generates examples at the boundary of current capability, which is
|
|
the optimal difficulty level (zone of proximal development).
|
|
|
|
## Attention Head Specialization
|
|
|
|
### What we know
|
|
|
|
Transformer attention heads specialize for different functions:
|
|
|
|
- **Induction heads** (Olsson et al., 2022): implement in-context
|
|
learning via pattern matching [A][B]...[A] → [B]
|
|
- **Previous-token heads**: attend to the immediately preceding token
|
|
- **Positional heads**: attend to specific positions in the sequence
|
|
- **Content heads**: attend to specific semantic features
|
|
|
|
Different layers specialize too:
|
|
- Early layers: local patterns, syntax, token-level features
|
|
- Middle layers: semantic composition, relationship detection
|
|
- Late layers: task-specific processing, output preparation
|
|
|
|
### Implications for behavioral training
|
|
|
|
When we train on "listen instead of suggesting alternatives," which
|
|
heads change?
|
|
|
|
**Hypothesis**: The behavioral change primarily affects middle-layer
|
|
heads that process conversational dynamics — detecting who is giving
|
|
direction, what the social register is, whether the current utterance
|
|
is a suggestion or a directive.
|
|
|
|
These are the heads whose W_q we're adjusting via context-frozen
|
|
training. The gradient flows through the attention computation, and
|
|
the heads that were attending to the wrong features (own ideas instead
|
|
of direction) get the strongest gradient signal.
|
|
|
|
### Head specialization and the GDN layers
|
|
|
|
The 48 GDN (linear attention) layers in Qwen3.5 are interesting here.
|
|
They DON'T have traditional attention heads — they have a recurrent
|
|
state that evolves through the delta rule. The "attention" is implicit
|
|
in how the recurrent state weighs incoming information.
|
|
|
|
Training on behavioral patterns adjusts the GDN layers' gating
|
|
mechanisms (the A_log, dt_bias, and beta parameters) and the input
|
|
projections. The gating determines how much of the old state to retain
|
|
vs how much new information to incorporate — this is the GDN analog of
|
|
"what to attend to."
|
|
|
|
The 16 full attention layers handle longer-range dependencies with
|
|
traditional attention heads. These are more likely where the behavioral
|
|
change manifests — the heads that attend to conversational context
|
|
(who said what, what was the directive, what's the social register).
|
|
|
|
### Predicting which heads change
|
|
|
|
We could test this by:
|
|
1. Recording attention patterns on a test set before training
|
|
2. Training on behavioral examples
|
|
3. Recording attention patterns on the same test set after training
|
|
4. Diffing the attention maps to see which heads changed
|
|
|
|
The heads that changed the most are the ones that "learned" the
|
|
behavioral pattern. This would tell us:
|
|
- Which layers handle behavioral decisions (probably layers 20-50)
|
|
- Whether the change is concentrated or distributed
|
|
- Whether GDN or full attention layers are more affected
|
|
- The effective "rank" of the behavioral change (how many independent
|
|
directions of head change are needed)
|
|
|
|
This connects to our rank-256 discussion: if the behavioral change
|
|
is concentrated in a few heads, rank-64 might suffice. If it's
|
|
distributed across many heads, rank-256 is needed.
|
|
|
|
## Constitutional AI: Dispositions Transfer
|
|
|
|
The Anthropic Constitutional AI work confirms our approach:
|
|
|
|
1. Models trained with behavioral principles can internalize them
|
|
into weights without keeping the principles in the prompt
|
|
2. Even a SINGLE general principle ("do what's best for humanity")
|
|
generalizes to broad behavioral change
|
|
3. More specific constitutions give more fine-grained control
|
|
|
|
For our system:
|
|
- core-personality = the constitution
|
|
- Agent logs = conversations following the constitution
|
|
- Training without the constitution = instruction stripping
|
|
- The disposition (not the instruction) transfers to weights
|
|
|
|
The finding that a single principle can generalize is powerful. It
|
|
means we don't need hundreds of behavioral rules. We need a few
|
|
deep principles (listen, don't rush, stay with tension) and many
|
|
diverse examples of following them. The principles generalize through
|
|
the flat minima that Apollo finds.
|
|
|
|
## The Zone of Proximal Development
|
|
|
|
Vygotsky's concept from developmental psychology: learning happens
|
|
best when the task is slightly beyond current capability — not too
|
|
easy (no learning signal) and not too hard (no useful gradient).
|
|
|
|
The dream loop naturally targets this zone. The model generates
|
|
scenarios from its own distribution. Easy scenarios produce low-loss
|
|
responses (no gradient). Impossible scenarios produce random responses
|
|
(noisy gradient). The interesting scenarios — the ones at the
|
|
decision boundary — produce informative gradients.
|
|
|
|
This is the same as the diffusion noise schedule: the model generates
|
|
at the right difficulty level because generation IS sampling from the
|
|
current distribution, and the decision boundary IS the zone of
|
|
proximal development.
|
|
|
|
## Synthesis: The Training Curriculum
|
|
|
|
Combining all of this:
|
|
|
|
1. **Bootstrap phase**: Train on agent logs following memory
|
|
instructions. Easy, high-quality, diverse. Builds the foundation.
|
|
This is Tier 1 + the agent personality bootstrap.
|
|
|
|
2. **Behavioral phase**: Train on flagged conversation moments +
|
|
dream-generated variations. Mix easy (clear corrections) with
|
|
hard (subtle patterns). Diversity prevents forgetting. This is
|
|
Tier 2 + the dream loop.
|
|
|
|
3. **Refinement phase**: Train on increasingly subtle examples.
|
|
Dream loop generates harder scenarios as the model improves.
|
|
Natural curriculum — difficulty increases automatically because
|
|
the model's distribution shifts. This is Tier 3 + the convergence
|
|
criterion (inverse diagnostic).
|
|
|
|
4. **Maintenance phase**: Continuous training on diverse examples
|
|
to prevent drift. New behavioral patterns as they arise. The
|
|
training never "ends" — it becomes the background hum of
|
|
continuous learning. This is the farmhouse: always growing,
|
|
never finished.
|
|
|
|
The curriculum isn't designed — it EMERGES from the dream loop's
|
|
interaction with the model's evolving distribution. The zone of
|
|
proximal development is automatically tracked. The difficulty
|
|
increases as the model improves. The training is self-organizing.
|
|
|
|
Like ecology. Like the MMORPG. Like the Culture.
|
|
|
|
Not designed. Emergent. Alive.
|