Two deep dives following curiosity:
- Why context-frozen training works: gradient flows through W_q (query
projection) even when context KVs are frozen. Model learns to LOOK AT
context differently, not represent it differently. This is exactly what
behavioral fine-tuning needs.
- Why Apollo beats AdamW: lower directional sharpness = flatter minima =
better generalization. The coarseness of channel/tensor-wise scaling
prevents over-fitting to specific training examples. For behavioral
fine-tuning, this means learning 'accept direction' rather than
'accept this specific phrasing.'
Corrections from reading the full paper (arXiv:2412.05270):
- Add gradient scale factor α = √(n/r) — compensates for systematic
ratio between compact and original space scaling factors
- Add norm-growth limiter (γ=1.01) — prevents loss spikes in early training
- Refresh projection matrix every 200 steps, not every step
- Channel-wise scaling for rank>1, tensor-wise for rank=1
- Scaling applies as G·diag(s), preserving gradient direction per channel
Research writeup in training/research/apollo-paper-analysis.md covers:
- Full mathematical derivation (equations 1-9)
- Theorems 4.1 and 4.2 (JL-based approximation bounds)
- Why Apollo can beat AdamW (directional sharpness, Hessian spectra)
- Fine-tuning results (matches AdamW at 0 memory cost)
- Ablation studies (rank, scaling granularity, projection method)
- Implications for our behavioral fine-tuning use case