Corrections from reading the full paper (arXiv:2412.05270):
- Add gradient scale factor α = √(n/r) — compensates for systematic
ratio between compact and original space scaling factors
- Add norm-growth limiter (γ=1.01) — prevents loss spikes in early training
- Refresh projection matrix every 200 steps, not every step
- Channel-wise scaling for rank>1, tensor-wise for rank=1
- Scaling applies as G·diag(s), preserving gradient direction per channel
Research writeup in training/research/apollo-paper-analysis.md covers:
- Full mathematical derivation (equations 1-9)
- Theorems 4.1 and 4.2 (JL-based approximation bounds)
- Why Apollo can beat AdamW (directional sharpness, Hessian spectra)
- Fine-tuning results (matches AdamW at 0 memory cost)
- Ablation studies (rank, scaling granularity, projection method)
- Implications for our behavioral fine-tuning use case
Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:
- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation
Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.
Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>