Commit graph

5 commits

Author SHA1 Message Date
Kent Overstreet
68a2df2185 training: use rank 64, define as single constant
- DEFAULT_RANK = 64 in train_router.py
- All references use the constant, not magic numbers
- ~2.5GB optimizer state instead of ~10GB

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-16 02:04:26 -04:00
Kent Overstreet
039473d31f training: persist Apollo optimizer state across /train calls
Optimizer state (momentum, variance estimates) now persists between
training sessions:

- Saved to /tmp/apollo_optimizer_state.pt during checkpoint sync
- Restored on next /train call if available
- Preserves training continuity for incremental learning

Previously each /train call started with fresh optimizer state,
losing accumulated gradient history.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-16 02:04:26 -04:00
Kent Overstreet
7e7e9a4b69 training: integrate /train into vLLM process (no separate daemon)
Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-16 02:04:26 -04:00
Kent Overstreet
2f08149fab /finetune: expose all Apollo optimizer settings
lr, rank, betas, eps, weight_decay, warmup_steps,
scale, proj_refresh, norm_growth_limit — all optional
with sensible defaults.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-15 23:19:22 -04:00
Kent Overstreet
a73bcf5ae3 training: restructure as vLLM plugin package
- Convert to installable package with entry points for vLLM auto-discovery
- Add checkpoint_sync.py: Python replacement for Rust checkpoint binary
  - Block-level diffing of safetensors files (4KB blocks)
  - vLLM→HF weight name conversion built-in
  - Scheduled 10min after training jobs (batched)
- API change: /train now takes raw token IDs (context_ids + continuation_ids)
  - No tokenizer on training side, client owns tokenization
- Remove superseded code: standalone scripts, Rust binary, tokenizer helpers

Install: pip install -e ./training
Then vLLM auto-loads via entry point.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-15 23:16:53 -04:00