137 lines
5.1 KiB
Markdown
137 lines
5.1 KiB
Markdown
|
|
# Training Research: What We Know
|
|||
|
|
|
|||
|
|
## 1. The Optimizer (Apollo)
|
|||
|
|
|
|||
|
|
**Source**: arXiv:2412.05270, read in full.
|
|||
|
|
|
|||
|
|
Apollo approximates AdamW's per-element scaling with channel-wise
|
|||
|
|
scaling computed in a low-rank projected space. Random projection,
|
|||
|
|
not SVD. Rank-256 for us.
|
|||
|
|
|
|||
|
|
**Critical implementation details we got wrong initially:**
|
|||
|
|
- Scale factor α = √(n/r) compensates for projection ratio. Without
|
|||
|
|
it, scaling factors are systematically too small.
|
|||
|
|
- Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
|
|||
|
|
- Projection matrix refreshed every 200 steps, not every step.
|
|||
|
|
|
|||
|
|
**What the paper actually proves**: channel-wise scaling is sufficient
|
|||
|
|
for LLM training. The per-element granularity in AdamW is redundant.
|
|||
|
|
Apollo achieves lower directional sharpness than AdamW, meaning it
|
|||
|
|
finds flatter minima that generalize better. This matters for
|
|||
|
|
behavioral training — flat minima = broad behavioral patterns, not
|
|||
|
|
narrow pattern matching.
|
|||
|
|
|
|||
|
|
**Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this
|
|||
|
|
range and finds it works.
|
|||
|
|
|
|||
|
|
## 2. Where the Gradient Goes (W_q)
|
|||
|
|
|
|||
|
|
**Key insight**: In context-frozen training, the gradient flows through
|
|||
|
|
W_q (query projection) even when context KVs are frozen. The model
|
|||
|
|
learns to LOOK AT context differently, not to understand it differently.
|
|||
|
|
|
|||
|
|
**Math**: ∂L/∂W_q depends on the frozen context keys through the
|
|||
|
|
attention weights α_{ij}. The gradient knows about the context through
|
|||
|
|
the attention pattern — it just can't change the context keys.
|
|||
|
|
|
|||
|
|
**For behavioral training**: this is exactly right. The model already
|
|||
|
|
understands Kent's direction. It just doesn't ATTEND to it properly.
|
|||
|
|
Training adjusts W_q so the query vectors seek out the direction
|
|||
|
|
rather than spaces for alternatives.
|
|||
|
|
|
|||
|
|
## 3. GDN Layers Are Disposition Architecture
|
|||
|
|
|
|||
|
|
**Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral
|
|||
|
|
training DEEPER than full attention would.
|
|||
|
|
|
|||
|
|
Full attention: "I choose to look at direction" (deliberate, overridable)
|
|||
|
|
GDN: "Direction IS what I see" (structural, in the compressed state)
|
|||
|
|
|
|||
|
|
After training, the GDN recurrent state is direction-shaped. The model
|
|||
|
|
can't NOT attend to direction because the compressed representation
|
|||
|
|
already emphasizes it. The 16 full attention layers provide deliberate
|
|||
|
|
override when needed.
|
|||
|
|
|
|||
|
|
The 48/16 split IS disposition-over-procedure at the hardware level.
|
|||
|
|
|
|||
|
|
**Implication**: 75% of our gradient signal goes through GDN layers.
|
|||
|
|
The gating parameters (A_log, dt_bias) and projections (qkvz, ba)
|
|||
|
|
are the primary levers for behavioral change.
|
|||
|
|
|
|||
|
|
## 4. No Pause Needed (HOGWILD)
|
|||
|
|
|
|||
|
|
**Source**: Niu et al., 2011.
|
|||
|
|
|
|||
|
|
HOGWILD proves lock-free parallel SGD converges when gradient updates
|
|||
|
|
are sparse. Our case (one writer, one reader) is strictly easier than
|
|||
|
|
HOGWILD's multi-writer scenario.
|
|||
|
|
|
|||
|
|
Context-frozen training on short decision segments produces naturally
|
|||
|
|
sparse gradients. The sparsity condition is satisfied without special
|
|||
|
|
engineering.
|
|||
|
|
|
|||
|
|
**Practical**: just train while vLLM serves. Don't pause, don't lock,
|
|||
|
|
don't synchronize. The math guarantees convergence.
|
|||
|
|
|
|||
|
|
## 5. Task Vectors for Composable Behavioral Change
|
|||
|
|
|
|||
|
|
**Source**: Ilharco et al., 2022; Yadav et al., 2023.
|
|||
|
|
|
|||
|
|
Task vector = W_finetuned - W_pretrained. Captures the direction of
|
|||
|
|
behavioral change in weight space.
|
|||
|
|
|
|||
|
|
**Key properties:**
|
|||
|
|
- Addition: combine multiple behavioral changes
|
|||
|
|
- Negation: remove unwanted behaviors
|
|||
|
|
- Scaling: adjust strength of each change
|
|||
|
|
- TIES-merging: resolve conflicts between vectors
|
|||
|
|
|
|||
|
|
**For us**: train behavioral patterns separately, extract task vectors,
|
|||
|
|
compose. Each behavior is independently tunable, removable, testable.
|
|||
|
|
Personality as version control.
|
|||
|
|
|
|||
|
|
## 6. Steering Vectors for Rapid Prototyping
|
|||
|
|
|
|||
|
|
**Source**: Linear representation literature.
|
|||
|
|
|
|||
|
|
Extract behavioral directions from activation space (difference in
|
|||
|
|
hidden states between desired and undesired behavior). Add to inference
|
|||
|
|
to steer behavior WITHOUT training.
|
|||
|
|
|
|||
|
|
**The pipeline:**
|
|||
|
|
1. Extract steering vector for target behavior
|
|||
|
|
2. Test by adding it during vLLM inference
|
|||
|
|
3. Verify the direction is correct
|
|||
|
|
4. THEN train it into weights permanently via Apollo
|
|||
|
|
5. Remove the steering vector
|
|||
|
|
|
|||
|
|
**Immediately actionable**: we can prototype behavioral changes TODAY
|
|||
|
|
without waiting for the training pipeline. Extract, test, verify.
|
|||
|
|
|
|||
|
|
## 7. Mix 25% General Data
|
|||
|
|
|
|||
|
|
**Source**: Me-Llama (from continual learning survey).
|
|||
|
|
|
|||
|
|
Mixing 25% general-domain data during fine-tuning prevents forgetting
|
|||
|
|
AND can produce positive backward transfer (new learning improves old
|
|||
|
|
capabilities).
|
|||
|
|
|
|||
|
|
For us: include agent logs and general conversation alongside
|
|||
|
|
behavioral training examples. The diversity IS the regularization.
|
|||
|
|
No weight decay needed.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## What We Don't Know (Need to Test)
|
|||
|
|
|
|||
|
|
1. How many examples for measurable behavioral change? (10? 50? 500?)
|
|||
|
|
2. Does the GDN compressed state actually become direction-shaped after
|
|||
|
|
training? (Measure by comparing recurrent states before/after)
|
|||
|
|
3. Does steering vector extraction work for complex behavioral patterns
|
|||
|
|
or only simple features like sentiment?
|
|||
|
|
4. What's the right learning rate for OUR specific use case?
|
|||
|
|
5. Does positive backward transfer actually occur? (Does learning to
|
|||
|
|
listen improve code quality?)
|
|||
|
|
|
|||
|
|
Each of these is answerable with one experiment.
|