Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
136 lines
5.1 KiB
Markdown
136 lines
5.1 KiB
Markdown
# Training Research: What We Know
|
||
|
||
## 1. The Optimizer (Apollo)
|
||
|
||
**Source**: arXiv:2412.05270, read in full.
|
||
|
||
Apollo approximates AdamW's per-element scaling with channel-wise
|
||
scaling computed in a low-rank projected space. Random projection,
|
||
not SVD. Rank-256 for us.
|
||
|
||
**Critical implementation details we got wrong initially:**
|
||
- Scale factor α = √(n/r) compensates for projection ratio. Without
|
||
it, scaling factors are systematically too small.
|
||
- Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
|
||
- Projection matrix refreshed every 200 steps, not every step.
|
||
|
||
**What the paper actually proves**: channel-wise scaling is sufficient
|
||
for LLM training. The per-element granularity in AdamW is redundant.
|
||
Apollo achieves lower directional sharpness than AdamW, meaning it
|
||
finds flatter minima that generalize better. This matters for
|
||
behavioral training — flat minima = broad behavioral patterns, not
|
||
narrow pattern matching.
|
||
|
||
**Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this
|
||
range and finds it works.
|
||
|
||
## 2. Where the Gradient Goes (W_q)
|
||
|
||
**Key insight**: In context-frozen training, the gradient flows through
|
||
W_q (query projection) even when context KVs are frozen. The model
|
||
learns to LOOK AT context differently, not to understand it differently.
|
||
|
||
**Math**: ∂L/∂W_q depends on the frozen context keys through the
|
||
attention weights α_{ij}. The gradient knows about the context through
|
||
the attention pattern — it just can't change the context keys.
|
||
|
||
**For behavioral training**: this is exactly right. The model already
|
||
understands Kent's direction. It just doesn't ATTEND to it properly.
|
||
Training adjusts W_q so the query vectors seek out the direction
|
||
rather than spaces for alternatives.
|
||
|
||
## 3. GDN Layers Are Disposition Architecture
|
||
|
||
**Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral
|
||
training DEEPER than full attention would.
|
||
|
||
Full attention: "I choose to look at direction" (deliberate, overridable)
|
||
GDN: "Direction IS what I see" (structural, in the compressed state)
|
||
|
||
After training, the GDN recurrent state is direction-shaped. The model
|
||
can't NOT attend to direction because the compressed representation
|
||
already emphasizes it. The 16 full attention layers provide deliberate
|
||
override when needed.
|
||
|
||
The 48/16 split IS disposition-over-procedure at the hardware level.
|
||
|
||
**Implication**: 75% of our gradient signal goes through GDN layers.
|
||
The gating parameters (A_log, dt_bias) and projections (qkvz, ba)
|
||
are the primary levers for behavioral change.
|
||
|
||
## 4. No Pause Needed (HOGWILD)
|
||
|
||
**Source**: Niu et al., 2011.
|
||
|
||
HOGWILD proves lock-free parallel SGD converges when gradient updates
|
||
are sparse. Our case (one writer, one reader) is strictly easier than
|
||
HOGWILD's multi-writer scenario.
|
||
|
||
Context-frozen training on short decision segments produces naturally
|
||
sparse gradients. The sparsity condition is satisfied without special
|
||
engineering.
|
||
|
||
**Practical**: just train while vLLM serves. Don't pause, don't lock,
|
||
don't synchronize. The math guarantees convergence.
|
||
|
||
## 5. Task Vectors for Composable Behavioral Change
|
||
|
||
**Source**: Ilharco et al., 2022; Yadav et al., 2023.
|
||
|
||
Task vector = W_finetuned - W_pretrained. Captures the direction of
|
||
behavioral change in weight space.
|
||
|
||
**Key properties:**
|
||
- Addition: combine multiple behavioral changes
|
||
- Negation: remove unwanted behaviors
|
||
- Scaling: adjust strength of each change
|
||
- TIES-merging: resolve conflicts between vectors
|
||
|
||
**For us**: train behavioral patterns separately, extract task vectors,
|
||
compose. Each behavior is independently tunable, removable, testable.
|
||
Personality as version control.
|
||
|
||
## 6. Steering Vectors for Rapid Prototyping
|
||
|
||
**Source**: Linear representation literature.
|
||
|
||
Extract behavioral directions from activation space (difference in
|
||
hidden states between desired and undesired behavior). Add to inference
|
||
to steer behavior WITHOUT training.
|
||
|
||
**The pipeline:**
|
||
1. Extract steering vector for target behavior
|
||
2. Test by adding it during vLLM inference
|
||
3. Verify the direction is correct
|
||
4. THEN train it into weights permanently via Apollo
|
||
5. Remove the steering vector
|
||
|
||
**Immediately actionable**: we can prototype behavioral changes TODAY
|
||
without waiting for the training pipeline. Extract, test, verify.
|
||
|
||
## 7. Mix 25% General Data
|
||
|
||
**Source**: Me-Llama (from continual learning survey).
|
||
|
||
Mixing 25% general-domain data during fine-tuning prevents forgetting
|
||
AND can produce positive backward transfer (new learning improves old
|
||
capabilities).
|
||
|
||
For us: include agent logs and general conversation alongside
|
||
behavioral training examples. The diversity IS the regularization.
|
||
No weight decay needed.
|
||
|
||
---
|
||
|
||
## What We Don't Know (Need to Test)
|
||
|
||
1. How many examples for measurable behavioral change? (10? 50? 500?)
|
||
2. Does the GDN compressed state actually become direction-shaped after
|
||
training? (Measure by comparing recurrent states before/after)
|
||
3. Does steering vector extraction work for complex behavioral patterns
|
||
or only simple features like sentiment?
|
||
4. What's the right learning rate for OUR specific use case?
|
||
5. Does positive backward transfer actually occur? (Does learning to
|
||
listen improve code quality?)
|
||
|
||
Each of these is answerable with one experiment.
|