training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
ProofOfConcept 2026-04-16 02:01:59 -04:00 committed by Kent Overstreet
parent 68a2df2185
commit 2c6a5c0f4a
6 changed files with 503 additions and 233 deletions

View file

@ -26,25 +26,37 @@ The training signal comes from two sources:
│ └──────────────┬──────────────┬────────────────┘ │
│ │ │ │
│ ┌──────────────▼──┐ ┌───────▼────────────────┐ │
│ │ vLLM (inference)│ │ HF model (training) │ │
│ │ KV cache ~60GB │ │ Gradients ~54GB │ │
│ │ /completions │ │ Optimizer state ~10GB │ │
│ │ /score │ │ Views into vLLM weights │ │
│ │ /train ────────┼──┼─► Apollo optimizer │ │
│ └─────────────────┘ └────────────────────────┘ │
│ │ vLLM (inference)│ │ Training subprocess │ │
│ │ KV cache ~60GB │ │ HF model wrapper │ │
│ │ /completions │ │ Apollo optimizer ~2.5GB │ │
│ │ /score │ │ Checkpoint sync │ │
│ └────────┬────────┘ └───────────▲─────────────┘ │
│ │ │ │
│ │ ZMQ IPC │ │
│ └───────────────────────┘ │
└─────────────────────────────────────────────────────┘
Single vLLM process serves everything
No separate daemon - /train is a vLLM route
Process Architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ vLLM Worker │ │ vLLM API Server │ │ Training Worker │
│ (GPU inference) │ │ (HTTP routes) │ │ (GPU training) │
│ │ │ │ │ │
│ export_hook.py │ │ /completions │ │ HF model views │
│ exports IPC │ │ /score │ │ Apollo optimizer│
│ handles on load │ │ /train ─────────┼──► ZMQ REP socket │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
└──── IPC handles file ──────────────────┘
/tmp/vllm_weight_handles.pt
Moria B200 (vLLM)
┌──────────────────┐ ┌──────────────────┐
│ Training signal │ HTTP │ /completions │
│ agent │──────────>│ /score │
│ │ │ /train │
│ Dream loop │ │ │
│ (generates │ │ Checkpoint sync │
│ scenarios) │ │ (10 min batched) │
│ Dream loop │ │ /checkpoint
│ (generates │ │ /train/status
│ scenarios) │ │
└──────────────────┘ └──────────────────┘
```
@ -213,8 +225,9 @@ a few hundred MB.
| File | Purpose |
|------|---------|
| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by train_router to construct HF model with vLLM weight views. |
| `/tmp/apollo_optimizer_state.pt` | Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync, restored on next /train call. Preserves training continuity across sessions. |
| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by training_worker to construct HF model with vLLM weight views. Includes metadata (model_path). |
| `/tmp/apollo_optimizer_state.pt` | Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync and on worker shutdown, restored on next training_worker startup. Preserves training continuity across sessions. |
| `/tmp/apollo_training.sock` | ZMQ IPC socket for communication between API server (/train endpoint) and training_worker subprocess. |
| `<model_dir>/*.safetensors` | Model weights. Updated in-place by checkpoint_sync. |
### Moria (client)
@ -224,12 +237,13 @@ a few hundred MB.
| `~/.consciousness/cache/trained-responses.json` | Timestamps (ms) of responses already sent to /train. Prevents re-training the same response. |
| `~/.consciousness/cache/finetune-alternates` | Marker file. If exists, alternate responses are generated during divergence scoring to show what model would say without memories. |
### In-memory
### In-memory (training_worker subprocess)
| State | Location | Notes |
|-------|----------|-------|
| Apollo optimizer | train_router._optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync. |
| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. |
| Apollo optimizer | TrainingWorker.optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync and on shutdown. |
| HF model with vLLM views | TrainingWorker.model | Loaded on worker startup from IPC handles. Parameters point to vLLM's GPU memory. |
| ZMQ socket | TrainingWorker.zmq_socket | REP socket bound to `/tmp/apollo_training.sock`. |
## Hyperparameters
@ -248,7 +262,8 @@ a few hundred MB.
### Built ✓
- `optimizer.py` — Apollo optimizer (configurable rank)
- `train_router.py` — /train endpoint, runs in vLLM process
- `train_router.py` — /train endpoint, forwards to training subprocess via ZMQ
- `training_worker.py` — training subprocess (HF model, Apollo, checkpoint sync)
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
- `export_hook.py` — vLLM plugin hook for IPC handle export
- `checkpoint_sync.py` — mmap + diff checkpoint sync (Python)
@ -267,8 +282,9 @@ training/
pyproject.toml — package config, vLLM plugin entry point
apollo_plugin/
__init__.py — plugin registration
export_hook.py — patches vLLM to export IPC handles
train_router.py — /train endpoint (FastAPI router)
export_hook.py — patches vLLM worker to export IPC handles
train_router.py — /train endpoint, forwards to worker via ZMQ
training_worker.py — training subprocess (HF model, Apollo, checkpoint)
optimizer.py — Apollo optimizer
weight_mapping.py — vLLM ↔ HF weight views
checkpoint_sync.py — mmap + diff sync to safetensors