training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
Kent Overstreet 2026-04-16 00:48:05 -04:00
parent 2f08149fab
commit 7e7e9a4b69
6 changed files with 320 additions and 542 deletions

View file

@ -22,25 +22,29 @@ The training signal comes from two sources:
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Model Weights (54GB, bf16) │ │
│ │ Shared via CUDA IPC │ │
│ │ Shared: vLLM inference + HF training │ │
│ └──────────────┬──────────────┬────────────────┘ │
│ │ │ │
│ ┌──────────────▼──┐ ┌───────▼────────────────┐ │
│ │ vLLM (inference)│ │ Apollo (training) │ │
│ │ vLLM (inference)│ │ HF model (training) │ │
│ │ KV cache ~60GB │ │ Gradients ~54GB │ │
│ │ Serves requests │ │ Optimizer state ~10GB │ │
│ │ Never paused │ │ Activations ~10GB │ │
│ │ /completions │ │ Optimizer state ~10GB │ │
│ │ /score │ │ Views into vLLM weights │ │
│ │ /train ────────┼──┼─► Apollo optimizer │ │
│ └─────────────────┘ └────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Moria B200
Single vLLM process serves everything
No separate daemon - /train is a vLLM route
Moria B200 (vLLM)
┌──────────────────┐ ┌──────────────────┐
│ Training signal │ HTTP │ Apollo worker │
│ agent │──────────>│ daemon │
│ │ │
│ Dream loop │ │ Checkpoint sync
│ (generates │ │ (mmap + diff,
│ scenarios) │ │ every 10 min)
│ Training signal │ HTTP │ /completions
│ agent │──────────>│ /score
│ │ │ /train
│ Dream loop │ │
│ (generates │ │ Checkpoint sync
│ scenarios) │ │ (10 min batched)
└──────────────────┘ └──────────────────┘
```
@ -220,34 +224,30 @@ a few hundred MB.
## Components
### Built ✓
- `apollo_mini.py` — Apollo optimizer (configurable rank, default 256)
- `apollo_worker.py` — HTTP daemon (aiohttp, job tracking)
- `optimizer.py` — Apollo optimizer (configurable rank, default 256)
- `train_router.py` — /train endpoint, runs in vLLM process
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
- `training_example.py` — tokenization with chat template
- `vllm_export_hook.py` — source patch for IPC handle export
- `checkpoint/` — Rust tool for mmap + diff checkpoint sync
- `export_hook.py` — vLLM plugin hook for IPC handle export
- `checkpoint_sync.py` — mmap + diff checkpoint sync (Python)
### To build
- **Dream loop → training bridge**: connect dream output to Apollo
- **Dream loop → training bridge**: connect dream output to /train
- **Training-signal agent**: flags moments in conversation logs
- **Instruction stripping**: remove scaffolding from training examples
- **Quality monitoring**: track model capability over time
- **HF model forward pass integration**: wire into apollo_worker
## Files
```
training/
DESIGN.md — this document
apollo_mini.py — Apollo optimizer
apollo_worker.py — HTTP training daemon
weight_mapping.py — vLLM ↔ HF weight views
training_example.py — tokenization helpers
export_weights.py — standalone weight export (unused)
vllm_export_hook.py — vLLM source patch for IPC export
start_vllm_with_apollo.sh — vLLM launcher (unused, using source patch)
train.py — standalone training script (alternative)
checkpoint/
Cargo.toml — Rust checkpoint tool
src/main.rs — mmap + diff sync
DESIGN.md — this document
pyproject.toml — package config, vLLM plugin entry point
apollo_plugin/
__init__.py — plugin registration
export_hook.py — patches vLLM to export IPC handles
train_router.py — /train endpoint (FastAPI router)
optimizer.py — Apollo optimizer
weight_mapping.py — vLLM ↔ HF weight views
checkpoint_sync.py — mmap + diff sync to safetensors
steering.py — steering vector extraction (experimental)
```