training: integrate /train into vLLM process (no separate daemon)
Remove standalone worker.py daemon. Training now runs inside vLLM: - train_router.py: FastAPI router patched into vLLM's build_app() - /train served on same port as /completions, /score - Lazy-loads HF model with vLLM weight views on first request - HOGWILD training: no pause, weights updated in-place The previous architecture had a separate daemon on port 8080 that communicated with vLLM via pause/resume endpoints. This was wrong - training should run in-process, sharing GPU memory directly. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
2f08149fab
commit
7e7e9a4b69
6 changed files with 320 additions and 542 deletions
|
|
@ -1,8 +1,8 @@
|
|||
"""Apollo training plugin for vLLM.
|
||||
|
||||
Enables continuous fine-tuning alongside live inference by:
|
||||
1. Exporting CUDA IPC handles for weight sharing
|
||||
2. Providing a training worker daemon (/train endpoint)
|
||||
1. Exporting CUDA IPC handles for weight sharing (export_hook)
|
||||
2. Adding /train endpoint to vLLM's HTTP server (train_router)
|
||||
3. Block-level checkpoint sync to safetensors files
|
||||
|
||||
Install: pip install -e /path/to/training
|
||||
|
|
@ -10,8 +10,10 @@ Then vLLM auto-loads via entry point.
|
|||
"""
|
||||
|
||||
from .export_hook import _patch_model_runner
|
||||
from .train_router import _patch_api_server
|
||||
|
||||
|
||||
def register():
|
||||
"""Called by vLLM's plugin loader on startup."""
|
||||
_patch_model_runner()
|
||||
_patch_api_server()
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue