training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
ProofOfConcept 2026-04-16 02:01:59 -04:00 committed by Kent Overstreet
parent 68a2df2185
commit 2c6a5c0f4a
6 changed files with 503 additions and 233 deletions

View file

@ -260,6 +260,9 @@ def load_vllm_weights(handles_path: str) -> Dict[str, torch.Tensor]:
"""
handles = torch.load(handles_path, weights_only=False)
# Skip metadata entry
handles.pop('__metadata__', None)
weights = {}
for name, info in handles.items():
func, args = info['handle']