training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM: - train_router.py: FastAPI router patched into vLLM's build_app() - /train served on same port as /completions, /score - Lazy-loads HF model with vLLM weight views on first request - HOGWILD training: no pause, weights updated in-place The previous architecture had a separate daemon on port 8080 that communicated with vLLM via pause/resume endpoints. This was wrong - training should run in-process, sharing GPU memory directly. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-16 00:48:05 -04:00 · 2026-04-16 00:48:05 -04:00 · 7e7e9a4b69
commit 7e7e9a4b69
parent 2f08149fab
6 changed files with 320 additions and 542 deletions
--- a/training/apollo_plugin/init.py
+++ b/training/apollo_plugin/init.py
@ -1,8 +1,8 @@
 """Apollo training plugin for vLLM.

 Enables continuous fine-tuning alongside live inference by:
-1. Exporting CUDA IPC handles for weight sharing
-2. Providing a training worker daemon (/train endpoint)
+1. Exporting CUDA IPC handles for weight sharing (export_hook)
+2. Adding /train endpoint to vLLM's HTTP server (train_router)
 3. Block-level checkpoint sync to safetensors files

 Install: pip install -e /path/to/training
@ -10,8 +10,10 @@ Then vLLM auto-loads via entry point.
 """

 from .export_hook import _patch_model_runner
+from .train_router import _patch_api_server


 def register():
    """Called by vLLM's plugin loader on startup."""
    _patch_model_runner()
+    _patch_api_server()