training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training work, owns HF model wrapper (views into vLLM GPU memory), Apollo optimizer, and checkpoint sync - train_router.py: now forwards /train requests via async ZMQ instead of running training in-process. Adds /checkpoint and /train/status endpoints - export_hook.py: store model_path in __metadata__ so training worker can find it without cross-process communication - This fixes two bugs: 1. Process boundary issue - model_path was set in worker process but needed in API server process 2. Blocking event loop - training blocked vLLM's async event loop Architecture: vLLM API server <-> ZMQ <-> training subprocess The subprocess loads IPC handles once, creates views into vLLM's GPU memory, and handles training requests without blocking inference. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-16 02:01:59 -04:00 · 2026-04-16 02:01:59 -04:00 · 2c6a5c0f4a
commit 2c6a5c0f4a
parent 68a2df2185
6 changed files with 503 additions and 233 deletions
--- a/training/DESIGN.md
+++ b/training/DESIGN.md
@ -26,25 +26,37 @@ The training signal comes from two sources:
 │  └──────────────┬──────────────┬────────────────┘   │
 │                 │              │                     │
 │  ┌──────────────▼──┐  ┌───────▼────────────────┐   │
-│  │ vLLM (inference)│  │ HF model (training)     │   │
-│  │ KV cache ~60GB  │  │ Gradients ~54GB         │   │
-│  │ /completions    │  │ Optimizer state ~10GB   │   │
-│  │ /score          │  │ Views into vLLM weights │   │
-│  │ /train  ────────┼──┼─► Apollo optimizer      │   │
-│  └─────────────────┘  └────────────────────────┘   │
+│  │ vLLM (inference)│  │ Training subprocess     │   │
+│  │ KV cache ~60GB  │  │ HF model wrapper        │   │
+│  │ /completions    │  │ Apollo optimizer ~2.5GB │   │
+│  │ /score          │  │ Checkpoint sync         │   │
+│  └────────┬────────┘  └───────────▲─────────────┘   │
+│           │                       │                  │
+│           │    ZMQ IPC            │                  │
+│           └───────────────────────┘                  │
 └─────────────────────────────────────────────────────┘

-         Single vLLM process serves everything
-         No separate daemon - /train is a vLLM route
+Process Architecture:
+┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
+│ vLLM Worker     │  │ vLLM API Server │  │ Training Worker │
+│ (GPU inference) │  │ (HTTP routes)   │  │ (GPU training)  │
+│                 │  │                 │  │                 │
+│ export_hook.py  │  │ /completions    │  │ HF model views  │
+│ exports IPC     │  │ /score          │  │ Apollo optimizer│
+│ handles on load │  │ /train ─────────┼──► ZMQ REP socket │
+└─────────────────┘  └─────────────────┘  └─────────────────┘
+         │                                        │
+         └──── IPC handles file ──────────────────┘
+              /tmp/vllm_weight_handles.pt

 Moria                          B200 (vLLM)
 ┌──────────────────┐           ┌──────────────────┐
 │ Training signal  │  HTTP     │ /completions     │
 │ agent            │──────────>│ /score           │
 │                  │           │ /train           │
-│ Dream loop       │           │                  │
-│ (generates       │           │ Checkpoint sync  │
-│  scenarios)      │           │ (10 min batched) │
+│ Dream loop       │           │ /checkpoint      │
+│ (generates       │           │ /train/status    │
+│  scenarios)      │           │                  │
 └──────────────────┘           └──────────────────┘
 ```

@ -213,8 +225,9 @@ a few hundred MB.

 | File | Purpose |
 |------|---------|
-| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by train_router to construct HF model with vLLM weight views. |
-| `/tmp/apollo_optimizer_state.pt` | Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync, restored on next /train call. Preserves training continuity across sessions. |
+| `/tmp/vllm_weight_handles.pt` | CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by training_worker to construct HF model with vLLM weight views. Includes metadata (model_path). |
+| `/tmp/apollo_optimizer_state.pt` | Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync and on worker shutdown, restored on next training_worker startup. Preserves training continuity across sessions. |
+| `/tmp/apollo_training.sock` | ZMQ IPC socket for communication between API server (/train endpoint) and training_worker subprocess. |
 | `<model_dir>/*.safetensors` | Model weights. Updated in-place by checkpoint_sync. |

 ### Moria (client)
@ -224,12 +237,13 @@ a few hundred MB.
 | `~/.consciousness/cache/trained-responses.json` | Timestamps (ms) of responses already sent to /train. Prevents re-training the same response. |
 | `~/.consciousness/cache/finetune-alternates` | Marker file. If exists, alternate responses are generated during divergence scoring to show what model would say without memories. |

-### In-memory
+### In-memory (training_worker subprocess)

 | State | Location | Notes |
 |-------|----------|-------|
-| Apollo optimizer | train_router._optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync. |
-| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. |
+| Apollo optimizer | TrainingWorker.optimizer | ~2.5GB for rank-64. Persisted to `/tmp/apollo_optimizer_state.pt` during checkpoint sync and on shutdown. |
+| HF model with vLLM views | TrainingWorker.model | Loaded on worker startup from IPC handles. Parameters point to vLLM's GPU memory. |
+| ZMQ socket | TrainingWorker.zmq_socket | REP socket bound to `/tmp/apollo_training.sock`. |

 ## Hyperparameters

@ -248,7 +262,8 @@ a few hundred MB.

 ### Built ✓
 - `optimizer.py` — Apollo optimizer (configurable rank)
- `train_router.py` — /train endpoint, runs in vLLM process
+- `train_router.py` — /train endpoint, forwards to training subprocess via ZMQ
+- `training_worker.py` — training subprocess (HF model, Apollo, checkpoint sync)
 - `weight_mapping.py` — vLLM merged → HF separate views (validated)
 - `export_hook.py` — vLLM plugin hook for IPC handle export
 - `checkpoint_sync.py` — mmap + diff checkpoint sync (Python)
@ -267,8 +282,9 @@ training/
  pyproject.toml                — package config, vLLM plugin entry point
  apollo_plugin/
    __init__.py                 — plugin registration
-    export_hook.py              — patches vLLM to export IPC handles
-    train_router.py             — /train endpoint (FastAPI router)
+    export_hook.py              — patches vLLM worker to export IPC handles
+    train_router.py             — /train endpoint, forwards to worker via ZMQ
+    training_worker.py          — training subprocess (HF model, Apollo, checkpoint)
    optimizer.py                — Apollo optimizer
    weight_mapping.py           — vLLM ↔ HF weight views
    checkpoint_sync.py          — mmap + diff sync to safetensors