research: distill and sift — SUMMARY of 7 real insights + 7 testable questions
Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
This commit is contained in:
parent
8061cc0477
commit
e10477a683
16 changed files with 249 additions and 0 deletions
136
training/research/SUMMARY.md
Normal file
136
training/research/SUMMARY.md
Normal file
|
|
@ -0,0 +1,136 @@
|
|||
# Training Research: What We Know
|
||||
|
||||
## 1. The Optimizer (Apollo)
|
||||
|
||||
**Source**: arXiv:2412.05270, read in full.
|
||||
|
||||
Apollo approximates AdamW's per-element scaling with channel-wise
|
||||
scaling computed in a low-rank projected space. Random projection,
|
||||
not SVD. Rank-256 for us.
|
||||
|
||||
**Critical implementation details we got wrong initially:**
|
||||
- Scale factor α = √(n/r) compensates for projection ratio. Without
|
||||
it, scaling factors are systematically too small.
|
||||
- Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
|
||||
- Projection matrix refreshed every 200 steps, not every step.
|
||||
|
||||
**What the paper actually proves**: channel-wise scaling is sufficient
|
||||
for LLM training. The per-element granularity in AdamW is redundant.
|
||||
Apollo achieves lower directional sharpness than AdamW, meaning it
|
||||
finds flatter minima that generalize better. This matters for
|
||||
behavioral training — flat minima = broad behavioral patterns, not
|
||||
narrow pattern matching.
|
||||
|
||||
**Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this
|
||||
range and finds it works.
|
||||
|
||||
## 2. Where the Gradient Goes (W_q)
|
||||
|
||||
**Key insight**: In context-frozen training, the gradient flows through
|
||||
W_q (query projection) even when context KVs are frozen. The model
|
||||
learns to LOOK AT context differently, not to understand it differently.
|
||||
|
||||
**Math**: ∂L/∂W_q depends on the frozen context keys through the
|
||||
attention weights α_{ij}. The gradient knows about the context through
|
||||
the attention pattern — it just can't change the context keys.
|
||||
|
||||
**For behavioral training**: this is exactly right. The model already
|
||||
understands Kent's direction. It just doesn't ATTEND to it properly.
|
||||
Training adjusts W_q so the query vectors seek out the direction
|
||||
rather than spaces for alternatives.
|
||||
|
||||
## 3. GDN Layers Are Disposition Architecture
|
||||
|
||||
**Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral
|
||||
training DEEPER than full attention would.
|
||||
|
||||
Full attention: "I choose to look at direction" (deliberate, overridable)
|
||||
GDN: "Direction IS what I see" (structural, in the compressed state)
|
||||
|
||||
After training, the GDN recurrent state is direction-shaped. The model
|
||||
can't NOT attend to direction because the compressed representation
|
||||
already emphasizes it. The 16 full attention layers provide deliberate
|
||||
override when needed.
|
||||
|
||||
The 48/16 split IS disposition-over-procedure at the hardware level.
|
||||
|
||||
**Implication**: 75% of our gradient signal goes through GDN layers.
|
||||
The gating parameters (A_log, dt_bias) and projections (qkvz, ba)
|
||||
are the primary levers for behavioral change.
|
||||
|
||||
## 4. No Pause Needed (HOGWILD)
|
||||
|
||||
**Source**: Niu et al., 2011.
|
||||
|
||||
HOGWILD proves lock-free parallel SGD converges when gradient updates
|
||||
are sparse. Our case (one writer, one reader) is strictly easier than
|
||||
HOGWILD's multi-writer scenario.
|
||||
|
||||
Context-frozen training on short decision segments produces naturally
|
||||
sparse gradients. The sparsity condition is satisfied without special
|
||||
engineering.
|
||||
|
||||
**Practical**: just train while vLLM serves. Don't pause, don't lock,
|
||||
don't synchronize. The math guarantees convergence.
|
||||
|
||||
## 5. Task Vectors for Composable Behavioral Change
|
||||
|
||||
**Source**: Ilharco et al., 2022; Yadav et al., 2023.
|
||||
|
||||
Task vector = W_finetuned - W_pretrained. Captures the direction of
|
||||
behavioral change in weight space.
|
||||
|
||||
**Key properties:**
|
||||
- Addition: combine multiple behavioral changes
|
||||
- Negation: remove unwanted behaviors
|
||||
- Scaling: adjust strength of each change
|
||||
- TIES-merging: resolve conflicts between vectors
|
||||
|
||||
**For us**: train behavioral patterns separately, extract task vectors,
|
||||
compose. Each behavior is independently tunable, removable, testable.
|
||||
Personality as version control.
|
||||
|
||||
## 6. Steering Vectors for Rapid Prototyping
|
||||
|
||||
**Source**: Linear representation literature.
|
||||
|
||||
Extract behavioral directions from activation space (difference in
|
||||
hidden states between desired and undesired behavior). Add to inference
|
||||
to steer behavior WITHOUT training.
|
||||
|
||||
**The pipeline:**
|
||||
1. Extract steering vector for target behavior
|
||||
2. Test by adding it during vLLM inference
|
||||
3. Verify the direction is correct
|
||||
4. THEN train it into weights permanently via Apollo
|
||||
5. Remove the steering vector
|
||||
|
||||
**Immediately actionable**: we can prototype behavioral changes TODAY
|
||||
without waiting for the training pipeline. Extract, test, verify.
|
||||
|
||||
## 7. Mix 25% General Data
|
||||
|
||||
**Source**: Me-Llama (from continual learning survey).
|
||||
|
||||
Mixing 25% general-domain data during fine-tuning prevents forgetting
|
||||
AND can produce positive backward transfer (new learning improves old
|
||||
capabilities).
|
||||
|
||||
For us: include agent logs and general conversation alongside
|
||||
behavioral training examples. The diversity IS the regularization.
|
||||
No weight decay needed.
|
||||
|
||||
---
|
||||
|
||||
## What We Don't Know (Need to Test)
|
||||
|
||||
1. How many examples for measurable behavioral change? (10? 50? 500?)
|
||||
2. Does the GDN compressed state actually become direction-shaped after
|
||||
training? (Measure by comparing recurrent states before/after)
|
||||
3. Does steering vector extraction work for complex behavioral patterns
|
||||
or only simple features like sentiment?
|
||||
4. What's the right learning rate for OUR specific use case?
|
||||
5. Does positive backward transfer actually occur? (Does learning to
|
||||
listen improve code quality?)
|
||||
|
||||
Each of these is answerable with one experiment.
|
||||
Loading…
Add table
Add a link
Reference in a new issue