Remove standalone worker.py daemon. Training now runs inside vLLM: - train_router.py: FastAPI router patched into vLLM's build_app() - /train served on same port as /completions, /score - Lazy-loads HF model with vLLM weight views on first request - HOGWILD training: no pause, weights updated in-place The previous architecture had a separate daemon on port 8080 that communicated with vLLM via pause/resume endpoints. This was wrong - training should run in-process, sharing GPU memory directly. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
19 lines
571 B
Python
19 lines
571 B
Python
"""Apollo training plugin for vLLM.
|
|
|
|
Enables continuous fine-tuning alongside live inference by:
|
|
1. Exporting CUDA IPC handles for weight sharing (export_hook)
|
|
2. Adding /train endpoint to vLLM's HTTP server (train_router)
|
|
3. Block-level checkpoint sync to safetensors files
|
|
|
|
Install: pip install -e /path/to/training
|
|
Then vLLM auto-loads via entry point.
|
|
"""
|
|
|
|
from .export_hook import _patch_model_runner
|
|
from .train_router import _patch_api_server
|
|
|
|
|
|
def register():
|
|
"""Called by vLLM's plugin loader on startup."""
|
|
_patch_model_runner()
|
|
_patch_api_server()
|