training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training work, owns HF model wrapper (views into vLLM GPU memory), Apollo optimizer, and checkpoint sync - train_router.py: now forwards /train requests via async ZMQ instead of running training in-process. Adds /checkpoint and /train/status endpoints - export_hook.py: store model_path in __metadata__ so training worker can find it without cross-process communication - This fixes two bugs: 1. Process boundary issue - model_path was set in worker process but needed in API server process 2. Blocking event loop - training blocked vLLM's async event loop Architecture: vLLM API server <-> ZMQ <-> training subprocess The subprocess loads IPC handles once, creates views into vLLM's GPU memory, and handles training requests without blocking inference. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-16 02:01:59 -04:00 · 2026-04-16 02:01:59 -04:00 · 2c6a5c0f4a
commit 2c6a5c0f4a
parent 68a2df2185
6 changed files with 503 additions and 233 deletions
--- a/training/pyproject.toml
+++ b/training/pyproject.toml
@ -11,6 +11,7 @@ dependencies = [
    "torch",
    "aiohttp",
    "safetensors",
+    "pyzmq",
 ]

 [project.optional-dependencies]
@ -21,6 +22,7 @@ apollo = "apollo_plugin:register"

 [project.scripts]
 apollo-checkpoint = "apollo_plugin.checkpoint_sync:main"
+apollo-worker = "apollo_plugin.training_worker:main"

 [tool.setuptools.packages.find]
 where = ["."]