training: integrate /train into vLLM process (no separate daemon)

Remove standalone worker.py daemon. Training now runs inside vLLM:

- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place

The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
Kent Overstreet 2026-04-16 00:48:05 -04:00
parent 2f08149fab
commit 7e7e9a4b69
6 changed files with 320 additions and 542 deletions

View file

@ -1,8 +1,8 @@
"""Apollo training plugin for vLLM.
Enables continuous fine-tuning alongside live inference by:
1. Exporting CUDA IPC handles for weight sharing
2. Providing a training worker daemon (/train endpoint)
1. Exporting CUDA IPC handles for weight sharing (export_hook)
2. Adding /train endpoint to vLLM's HTTP server (train_router)
3. Block-level checkpoint sync to safetensors files
Install: pip install -e /path/to/training
@ -10,8 +10,10 @@ Then vLLM auto-loads via entry point.
"""
from .export_hook import _patch_model_runner
from .train_router import _patch_api_server
def register():
"""Called by vLLM's plugin loader on startup."""
_patch_model_runner()
_patch_api_server()