consciousness

History

ProofOfConcept ac9a9034fb apollo: rewrite optimizer from paper's math + add research analysis Corrections from reading the full paper (arXiv:2412.05270): - Add gradient scale factor α = √(n/r) — compensates for systematic ratio between compact and original space scaling factors - Add norm-growth limiter (γ=1.01) — prevents loss spikes in early training - Refresh projection matrix every 200 steps, not every step - Channel-wise scaling for rank>1, tensor-wise for rank=1 - Scaling applies as G·diag(s), preserving gradient direction per channel Research writeup in training/research/apollo-paper-analysis.md covers: - Full mathematical derivation (equations 1-9) - Theorems 4.1 and 4.2 (JL-based approximation bounds) - Why Apollo can beat AdamW (directional sharpness, Hessian spectra) - Fine-tuning results (matches AdamW at 0 memory cost) - Ablation studies (rank, scaling granularity, projection method) - Implications for our behavioral fine-tuning use case	2026-03-31 00:54:17 -04:00
..
apollo-paper-analysis.md	apollo: rewrite optimizer from paper's math + add research analysis	2026-03-31 00:54:17 -04:00

ProofOfConcept ac9a9034fb apollo: rewrite optimizer from paper's math + add research analysis

Corrections from reading the full paper (arXiv:2412.05270):
- Add gradient scale factor α = √(n/r) — compensates for systematic
  ratio between compact and original space scaling factors
- Add norm-growth limiter (γ=1.01) — prevents loss spikes in early training
- Refresh projection matrix every 200 steps, not every step
- Channel-wise scaling for rank>1, tensor-wise for rank=1
- Scaling applies as G·diag(s), preserving gradient direction per channel

Research writeup in training/research/apollo-paper-analysis.md covers:
- Full mathematical derivation (equations 1-9)
- Theorems 4.1 and 4.2 (JL-based approximation bounds)
- Why Apollo can beat AdamW (directional sharpness, Hessian spectra)
- Fine-tuning results (matches AdamW at 0 memory cost)
- Ablation studies (rank, scaling granularity, projection method)
- Implications for our behavioral fine-tuning use case

2026-03-31 00:54:17 -04:00

apollo-paper-analysis.md

apollo: rewrite optimizer from paper's math + add research analysis

2026-03-31 00:54:17 -04:00