training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM: - train_router.py: FastAPI router patched into vLLM's build_app() - /train served on same port as /completions, /score - Lazy-loads HF model with vLLM weight views on first request - HOGWILD training: no pause, weights updated in-place The previous architecture had a separate daemon on port 8080 that communicated with vLLM via pause/resume endpoints. This was wrong - training should run in-process, sharing GPU memory directly. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-16 00:48:05 -04:00 · 2026-04-16 00:48:05 -04:00 · 7e7e9a4b69
commit 7e7e9a4b69
parent 2f08149fab
6 changed files with 320 additions and 542 deletions
--- a/training/DESIGN.md
+++ b/training/DESIGN.md
@ -22,25 +22,29 @@ The training signal comes from two sources:
 │                                                     │
 │  ┌──────────────────────────────────────────────┐   │
 │  │        Model Weights (54GB, bf16)            │   │
-│  │        Shared via CUDA IPC                   │   │
+│  │        Shared: vLLM inference + HF training  │   │
 │  └──────────────┬──────────────┬────────────────┘   │
 │                 │              │                     │
 │  ┌──────────────▼──┐  ┌───────▼────────────────┐   │
-│  │ vLLM (inference)│  │ Apollo (training)       │   │
+│  │ vLLM (inference)│  │ HF model (training)     │   │
 │  │ KV cache ~60GB  │  │ Gradients ~54GB         │   │
-│  │ Serves requests │  │ Optimizer state ~10GB   │   │
-│  │ Never paused    │  │ Activations ~10GB       │   │
+│  │ /completions    │  │ Optimizer state ~10GB   │   │
+│  │ /score          │  │ Views into vLLM weights │   │
+│  │ /train  ────────┼──┼─► Apollo optimizer      │   │
 │  └─────────────────┘  └────────────────────────┘   │
 └─────────────────────────────────────────────────────┘

-Moria                          B200
+         Single vLLM process serves everything
+         No separate daemon - /train is a vLLM route
+
+Moria                          B200 (vLLM)
 ┌──────────────────┐           ┌──────────────────┐
-│ Training signal  │  HTTP     │ Apollo worker    │
-│ agent            │──────────>│ daemon           │
-│                  │           │                  │
-│ Dream loop       │           │ Checkpoint sync  │
-│ (generates       │           │ (mmap + diff,    │
-│  scenarios)      │           │  every 10 min)   │
+│ Training signal  │  HTTP     │ /completions     │
+│ agent            │──────────>│ /score           │
+│                  │           │ /train           │
+│ Dream loop       │           │                  │
+│ (generates       │           │ Checkpoint sync  │
+│  scenarios)      │           │ (10 min batched) │
 └──────────────────┘           └──────────────────┘
 ```

@ -220,34 +224,30 @@ a few hundred MB.
 ## Components

 ### Built ✓
- `apollo_mini.py` — Apollo optimizer (configurable rank, default 256)
- `apollo_worker.py` — HTTP daemon (aiohttp, job tracking)
+- `optimizer.py` — Apollo optimizer (configurable rank, default 256)
+- `train_router.py` — /train endpoint, runs in vLLM process
 - `weight_mapping.py` — vLLM merged → HF separate views (validated)
- `training_example.py` — tokenization with chat template
- `vllm_export_hook.py` — source patch for IPC handle export
- `checkpoint/` — Rust tool for mmap + diff checkpoint sync
+- `export_hook.py` — vLLM plugin hook for IPC handle export
+- `checkpoint_sync.py` — mmap + diff checkpoint sync (Python)

 ### To build
- **Dream loop → training bridge**: connect dream output to Apollo
+- **Dream loop → training bridge**: connect dream output to /train
 - **Training-signal agent**: flags moments in conversation logs
 - **Instruction stripping**: remove scaffolding from training examples
 - **Quality monitoring**: track model capability over time
- **HF model forward pass integration**: wire into apollo_worker

 ## Files

 ```
 training/
-  DESIGN.md                 — this document
-  apollo_mini.py            — Apollo optimizer
-  apollo_worker.py          — HTTP training daemon
-  weight_mapping.py         — vLLM ↔ HF weight views
-  training_example.py       — tokenization helpers
-  export_weights.py         — standalone weight export (unused)
-  vllm_export_hook.py       — vLLM source patch for IPC export
-  start_vllm_with_apollo.sh — vLLM launcher (unused, using source patch)
-  train.py                  — standalone training script (alternative)
-  checkpoint/
-    Cargo.toml              — Rust checkpoint tool
-    src/main.rs             — mmap + diff sync
+  DESIGN.md                     — this document
+  pyproject.toml                — package config, vLLM plugin entry point
+  apollo_plugin/
+    __init__.py                 — plugin registration
+    export_hook.py              — patches vLLM to export IPC handles
+    train_router.py             — /train endpoint (FastAPI router)
+    optimizer.py                — Apollo optimizer
+    weight_mapping.py           — vLLM ↔ HF weight views
+    checkpoint_sync.py          — mmap + diff sync to safetensors
+    steering.py                 — steering vector extraction (experimental)
 ```