Commit graph

11 commits

Author SHA1 Message Date
ProofOfConcept
ac9a9034fb apollo: rewrite optimizer from paper's math + add research analysis
Corrections from reading the full paper (arXiv:2412.05270):
- Add gradient scale factor α = √(n/r) — compensates for systematic
  ratio between compact and original space scaling factors
- Add norm-growth limiter (γ=1.01) — prevents loss spikes in early training
- Refresh projection matrix every 200 steps, not every step
- Channel-wise scaling for rank>1, tensor-wise for rank=1
- Scaling applies as G·diag(s), preserving gradient direction per channel

Research writeup in training/research/apollo-paper-analysis.md covers:
- Full mathematical derivation (equations 1-9)
- Theorems 4.1 and 4.2 (JL-based approximation bounds)
- Why Apollo can beat AdamW (directional sharpness, Hessian spectra)
- Fine-tuning results (matches AdamW at 0 memory cost)
- Ablation studies (rank, scaling granularity, projection method)
- Implications for our behavioral fine-tuning use case
2026-03-31 00:54:17 -04:00
ProofOfConcept
60e61555c7 DESIGN.md: complete rewrite reflecting validated architecture
HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.
2026-03-31 00:42:53 -04:00
ProofOfConcept
2ecf4e21ff weight_mapping: strip language_model prefix to match HF text model names 2026-03-30 23:11:03 -04:00
ProofOfConcept
6fb9735def weight_mapping: fix name prefix, add attention QKV dims 2026-03-30 23:09:08 -04:00
ProofOfConcept
d0883e101b checkpoint: sync live weights back into model safetensors in-place
mmap each safetensors file, diff block-by-block against live GPU
weights, memcpy only changed blocks. No separate checkpoint files —
the model directory IS the checkpoint. Every 10 min via cron.
2026-03-30 22:55:23 -04:00
ProofOfConcept
c1245ab139 apollo-checkpoint: efficient diff-based GPU weight checkpointing
Rust tool that mmaps previous checkpoint, diffs against live GPU weights
(via CUDA IPC handles), and only writes changed blocks. For small
behavioral training steps, turns 54GB write into ~500MB.

Also includes vllm_export_hook.py with direct source patch approach —
exports IPC handles from vLLM's worker subprocess after model load.

Run every 10 minutes via cron to protect against vLLM crashes.
Daily rsync to moria for long-term storage.
2026-03-30 22:53:17 -04:00
ProofOfConcept
5f41898bb8 vllm launcher with apollo hook 2026-03-30 22:24:02 -04:00
ProofOfConcept
0402a9333c vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00
ProofOfConcept
8e7b4a22db apollo: default rank 256 — 0.25% compute cost, captures gradient structure across 100+ examples 2026-03-30 22:16:34 -04:00
ProofOfConcept
e1cd4fb0ab apollo: make rank configurable (default 1 = Mini, higher ranks for experimentation) 2026-03-30 22:06:31 -04:00
ProofOfConcept
c5d7d8cb5d apollo-mini training system: initial implementation
Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>
2026-03-30 22:02:37 -04:00