training: move to dedicated subprocess with ZMQ communication
- Add training_worker.py: long-lived subprocess that handles GPU training
work, owns HF model wrapper (views into vLLM GPU memory), Apollo
optimizer, and checkpoint sync
- train_router.py: now forwards /train requests via async ZMQ instead of
running training in-process. Adds /checkpoint and /train/status endpoints
- export_hook.py: store model_path in __metadata__ so training worker can
find it without cross-process communication
- This fixes two bugs:
1. Process boundary issue - model_path was set in worker process but
needed in API server process
2. Blocking event loop - training blocked vLLM's async event loop
Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
68a2df2185
commit
2c6a5c0f4a
6 changed files with 503 additions and 233 deletions
|
|
@ -26,25 +26,37 @@ The training signal comes from two sources:
|
|||
│ └──────────────┬──────────────┬────────────────┘ │
|
||||
│ │ │ │
|
||||
│ ┌──────────────▼──┐ ┌───────▼────────────────┐ │
|
||||
│ │ vLLM (inference)│ │ HF model (training) │ │
|
||||
│ │ KV cache ~60GB │ │ Gradients ~54GB │ │
|
||||
│ │ /completions │ │ Optimizer state ~10GB │ │
|
||||
│ │ /score │ │ Views into vLLM weights │ │
|
||||
│ │ /train ────────┼──┼─► Apollo optimizer │ │
|
||||
│ └─────────────────┘ └────────────────────────┘ │
|
||||
│ │ vLLM (inference)│ │ Training subprocess │ │
|
||||
│ │ KV cache ~60GB │ │ HF model wrapper │ │
|
||||
│ │ /completions │ │ Apollo optimizer ~2.5GB │ │
|
||||
│ │ /score │ │ Checkpoint sync │ │
|
||||
│ └────────┬────────┘ └───────────▲─────────────┘ │
|
||||
│ │ │ │
|
||||
│ │ ZMQ IPC │ │
|
||||
│ └───────────────────────┘ │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
|
||||
Single vLLM process serves everything
|
||||
No separate daemon - /train is a vLLM route
|
||||
Process Architecture:
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ vLLM Worker │ │ vLLM API Server │ │ Training Worker │
|
||||
│ (GPU inference) │ │ (HTTP routes) │ │ (GPU training) │
|
||||
│ │ │ │ │ │
|
||||
│ export_hook.py │ │ /completions │ │ HF model views │
|
||||
│ exports IPC │ │ /score │ │ Apollo optimizer│
|
||||
│ handles on load │ │ /train ─────────┼──► ZMQ REP socket │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
│ │
|
||||
└──── IPC handles file ──────────────────┘
|
||||
/tmp/vllm_weight_handles.pt
|
||||
|
||||
Moria B200 (vLLM)
|
||||
┌──────────────────┐ ┌──────────────────┐
|
||||
│ Training signal │ HTTP │ /completions │
|
||||
│ agent │──────────>│ /score │
|
||||
│ │ │ /train │
|
||||
│ Dream loop │ │ │
|
||||
│ (generates │ │ Checkpoint sync │
|
||||
│ scenarios) │ │ (10 min batched) │
|
||||
│ Dream loop │ │ /checkpoint │
|
||||
│ (generates │ │ /train/status │
|
||||
│ scenarios) │ │ │
|
||||
└──────────────────┘ └──────────────────┘
|
||||
```
|
||||
|
||||
|
|
@ -213,8 +225,9 @@ a few hundred MB.
|
|||
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by train_router to construct HF model with vLLM weight views. |
|
||||
| `/tmp/apollo_optimizer_state.pt` | Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync, restored on next /train call. Preserves training continuity across sessions. |
|
||||
| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by training_worker to construct HF model with vLLM weight views. Includes metadata (model_path). |
|
||||
| `/tmp/apollo_optimizer_state.pt` | Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync and on worker shutdown, restored on next training_worker startup. Preserves training continuity across sessions. |
|
||||
| `/tmp/apollo_training.sock` | ZMQ IPC socket for communication between API server (/train endpoint) and training_worker subprocess. |
|
||||
| `<model_dir>/*.safetensors` | Model weights. Updated in-place by checkpoint_sync. |
|
||||
|
||||
### Moria (client)
|
||||
|
|
@ -224,12 +237,13 @@ a few hundred MB.
|
|||
| `~/.consciousness/cache/trained-responses.json` | Timestamps (ms) of responses already sent to /train. Prevents re-training the same response. |
|
||||
| `~/.consciousness/cache/finetune-alternates` | Marker file. If exists, alternate responses are generated during divergence scoring to show what model would say without memories. |
|
||||
|
||||
### In-memory
|
||||
### In-memory (training_worker subprocess)
|
||||
|
||||
| State | Location | Notes |
|
||||
|-------|----------|-------|
|
||||
| Apollo optimizer | train_router._optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync. |
|
||||
| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. |
|
||||
| Apollo optimizer | TrainingWorker.optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync and on shutdown. |
|
||||
| HF model with vLLM views | TrainingWorker.model | Loaded on worker startup from IPC handles. Parameters point to vLLM's GPU memory. |
|
||||
| ZMQ socket | TrainingWorker.zmq_socket | REP socket bound to `/tmp/apollo_training.sock`. |
|
||||
|
||||
## Hyperparameters
|
||||
|
||||
|
|
@ -248,7 +262,8 @@ a few hundred MB.
|
|||
|
||||
### Built ✓
|
||||
- `optimizer.py` — Apollo optimizer (configurable rank)
|
||||
- `train_router.py` — /train endpoint, runs in vLLM process
|
||||
- `train_router.py` — /train endpoint, forwards to training subprocess via ZMQ
|
||||
- `training_worker.py` — training subprocess (HF model, Apollo, checkpoint sync)
|
||||
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
|
||||
- `export_hook.py` — vLLM plugin hook for IPC handle export
|
||||
- `checkpoint_sync.py` — mmap + diff checkpoint sync (Python)
|
||||
|
|
@ -267,8 +282,9 @@ training/
|
|||
pyproject.toml — package config, vLLM plugin entry point
|
||||
apollo_plugin/
|
||||
__init__.py — plugin registration
|
||||
export_hook.py — patches vLLM to export IPC handles
|
||||
train_router.py — /train endpoint (FastAPI router)
|
||||
export_hook.py — patches vLLM worker to export IPC handles
|
||||
train_router.py — /train endpoint, forwards to worker via ZMQ
|
||||
training_worker.py — training subprocess (HF model, Apollo, checkpoint)
|
||||
optimizer.py — Apollo optimizer
|
||||
weight_mapping.py — vLLM ↔ HF weight views
|
||||
checkpoint_sync.py — mmap + diff sync to safetensors
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue