consciousness/training/apollo_plugin/export_hook.py

"""Monkey-patch vLLM to export weight IPC handles on startup.

Usage — install the apollo_plugin package:

    pip install -e /path/to/training

Then vLLM auto-discovers and loads via entry point. Or filter:

    VLLM_PLUGINS=apollo vllm serve Qwen/Qwen3.5-27B ...

The hook patches vLLM's model runner to export IPC handles after
model loading completes. The handles are saved to a file that the
Apollo training process reads.
"""

import atexit
import torch
from pathlib import Path

HANDLE_PATH = "/tmp/vllm_weight_handles.pt"


def export_model_weights(model):
    """Export CUDA IPC handles for all model parameters."""
    from torch.multiprocessing.reductions import reduce_tensor

    handles = {}
    total_bytes = 0

    for name, param in model.named_parameters():
        if param.device.type != 'cuda':
            continue
        handle = reduce_tensor(param.data)
        handles[name] = {
            'handle': handle,
            'shape': list(param.shape),
            'dtype': str(param.dtype),
        }
        total_bytes += param.nelement() * param.element_size()

    torch.save(handles, HANDLE_PATH)
    print(f"[apollo] Exported {len(handles)} weight handles "
          f"({total_bytes / 1e9:.1f} GB) to {HANDLE_PATH}")


def _patch_model_runner():
    """Patch gpu_worker to export handles after model loading.

    vLLM loads the model in a subprocess (EngineCore_DP0), so we
    can't patch from the parent. Instead, patch the worker's
    init_device or load_model at the module level — the subprocess
    imports the same modules.
    """
    from vllm.v1.worker import gpu_worker

    original_load = gpu_worker.Worker.load_model

    def patched_load(self, *args, **kwargs):
        result = original_load(self, *args, **kwargs)
        try:
            export_model_weights(self.model_runner.model)
            # Set model path for training router
            model_path = self.vllm_config.model_config.model
            from .train_router import set_model_path
            set_model_path(model_path)
        except Exception as e:
            print(f"[apollo] Failed to export weights: {e}")
        return result

    gpu_worker.Worker.load_model = patched_load
    print("[apollo] Weight export hook installed")
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00			`"""Monkey-patch vLLM to export weight IPC handles on startup.`

training: restructure as vLLM plugin package - Convert to installable package with entry points for vLLM auto-discovery - Add checkpoint_sync.py: Python replacement for Rust checkpoint binary - Block-level diffing of safetensors files (4KB blocks) - vLLM→HF weight name conversion built-in - Scheduled 10min after training jobs (batched) - API change: /train now takes raw token IDs (context_ids + continuation_ids) - No tokenizer on training side, client owns tokenization - Remove superseded code: standalone scripts, Rust binary, tokenizer helpers Install: pip install -e ./training Then vLLM auto-loads via entry point. Co-Authored-By: Proof of Concept <poc@bcachefs.org> 2026-04-15 23:16:53 -04:00			`Usage — install the apollo_plugin package:`
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00
training: restructure as vLLM plugin package - Convert to installable package with entry points for vLLM auto-discovery - Add checkpoint_sync.py: Python replacement for Rust checkpoint binary - Block-level diffing of safetensors files (4KB blocks) - vLLM→HF weight name conversion built-in - Scheduled 10min after training jobs (batched) - API change: /train now takes raw token IDs (context_ids + continuation_ids) - No tokenizer on training side, client owns tokenization - Remove superseded code: standalone scripts, Rust binary, tokenizer helpers Install: pip install -e ./training Then vLLM auto-loads via entry point. Co-Authored-By: Proof of Concept <poc@bcachefs.org> 2026-04-15 23:16:53 -04:00			`pip install -e /path/to/training`
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00
training: restructure as vLLM plugin package - Convert to installable package with entry points for vLLM auto-discovery - Add checkpoint_sync.py: Python replacement for Rust checkpoint binary - Block-level diffing of safetensors files (4KB blocks) - vLLM→HF weight name conversion built-in - Scheduled 10min after training jobs (batched) - API change: /train now takes raw token IDs (context_ids + continuation_ids) - No tokenizer on training side, client owns tokenization - Remove superseded code: standalone scripts, Rust binary, tokenizer helpers Install: pip install -e ./training Then vLLM auto-loads via entry point. Co-Authored-By: Proof of Concept <poc@bcachefs.org> 2026-04-15 23:16:53 -04:00			`Then vLLM auto-discovers and loads via entry point. Or filter:`
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00
training: restructure as vLLM plugin package - Convert to installable package with entry points for vLLM auto-discovery - Add checkpoint_sync.py: Python replacement for Rust checkpoint binary - Block-level diffing of safetensors files (4KB blocks) - vLLM→HF weight name conversion built-in - Scheduled 10min after training jobs (batched) - API change: /train now takes raw token IDs (context_ids + continuation_ids) - No tokenizer on training side, client owns tokenization - Remove superseded code: standalone scripts, Rust binary, tokenizer helpers Install: pip install -e ./training Then vLLM auto-loads via entry point. Co-Authored-By: Proof of Concept <poc@bcachefs.org> 2026-04-15 23:16:53 -04:00			`VLLM_PLUGINS=apollo vllm serve Qwen/Qwen3.5-27B ...`
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00
			`The hook patches vLLM's model runner to export IPC handles after`
			`model loading completes. The handles are saved to a file that the`
			`Apollo training process reads.`
			`"""`

			`import atexit`
			`import torch`
			`from pathlib import Path`

			`HANDLE_PATH = "/tmp/vllm_weight_handles.pt"`


			`def export_model_weights(model):`
			`"""Export CUDA IPC handles for all model parameters."""`
			`from torch.multiprocessing.reductions import reduce_tensor`

			`handles = {}`
			`total_bytes = 0`

			`for name, param in model.named_parameters():`
			`if param.device.type != 'cuda':`
			`continue`
			`handle = reduce_tensor(param.data)`
			`handles[name] = {`
			`'handle': handle,`
			`'shape': list(param.shape),`
			`'dtype': str(param.dtype),`
			`}`
			`total_bytes += param.nelement() * param.element_size()`

			`torch.save(handles, HANDLE_PATH)`
			`print(f"[apollo] Exported {len(handles)} weight handles "`
			`f"({total_bytes / 1e9:.1f} GB) to {HANDLE_PATH}")`


			`def _patch_model_runner():`
apollo-checkpoint: efficient diff-based GPU weight checkpointing Rust tool that mmaps previous checkpoint, diffs against live GPU weights (via CUDA IPC handles), and only writes changed blocks. For small behavioral training steps, turns 54GB write into ~500MB. Also includes vllm_export_hook.py with direct source patch approach — exports IPC handles from vLLM's worker subprocess after model load. Run every 10 minutes via cron to protect against vLLM crashes. Daily rsync to moria for long-term storage. 2026-03-30 22:53:17 -04:00			`"""Patch gpu_worker to export handles after model loading.`
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00
apollo-checkpoint: efficient diff-based GPU weight checkpointing Rust tool that mmaps previous checkpoint, diffs against live GPU weights (via CUDA IPC handles), and only writes changed blocks. For small behavioral training steps, turns 54GB write into ~500MB. Also includes vllm_export_hook.py with direct source patch approach — exports IPC handles from vLLM's worker subprocess after model load. Run every 10 minutes via cron to protect against vLLM crashes. Daily rsync to moria for long-term storage. 2026-03-30 22:53:17 -04:00			`vLLM loads the model in a subprocess (EngineCore_DP0), so we`
			`can't patch from the parent. Instead, patch the worker's`
			`init_device or load_model at the module level — the subprocess`
			`imports the same modules.`
			`"""`
			`from vllm.v1.worker import gpu_worker`

			`original_load = gpu_worker.Worker.load_model`
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00
			`def patched_load(self, args, *kwargs):`
			`result = original_load(self, args, *kwargs)`
			`try:`
apollo-checkpoint: efficient diff-based GPU weight checkpointing Rust tool that mmaps previous checkpoint, diffs against live GPU weights (via CUDA IPC handles), and only writes changed blocks. For small behavioral training steps, turns 54GB write into ~500MB. Also includes vllm_export_hook.py with direct source patch approach — exports IPC handles from vLLM's worker subprocess after model load. Run every 10 minutes via cron to protect against vLLM crashes. Daily rsync to moria for long-term storage. 2026-03-30 22:53:17 -04:00			`export_model_weights(self.model_runner.model)`
training: integrate /train into vLLM process (no separate daemon) Remove standalone worker.py daemon. Training now runs inside vLLM: - train_router.py: FastAPI router patched into vLLM's build_app() - /train served on same port as /completions, /score - Lazy-loads HF model with vLLM weight views on first request - HOGWILD training: no pause, weights updated in-place The previous architecture had a separate daemon on port 8080 that communicated with vLLM via pause/resume endpoints. This was wrong - training should run in-process, sharing GPU memory directly. Co-Authored-By: Proof of Concept <poc@bcachefs.org> 2026-04-16 00:48:05 -04:00			`# Set model path for training router`
			`model_path = self.vllm_config.model_config.model`
			`from .train_router import set_model_path`
			`set_model_path(model_path)`
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00			`except Exception as e:`
			`print(f"[apollo] Failed to export weights: {e}")`
			`return result`

apollo-checkpoint: efficient diff-based GPU weight checkpointing Rust tool that mmaps previous checkpoint, diffs against live GPU weights (via CUDA IPC handles), and only writes changed blocks. For small behavioral training steps, turns 54GB write into ~500MB. Also includes vllm_export_hook.py with direct source patch approach — exports IPC handles from vLLM's worker subprocess after model load. Run every 10 minutes via cron to protect against vLLM crashes. Daily rsync to moria for long-term storage. 2026-03-30 22:53:17 -04:00			`gpu_worker.Worker.load_model = patched_load`
vllm weight export hook: monkey-patches model runner to save IPC handles on load 2026-03-30 22:20:04 -04:00			`print("[apollo] Weight export hook installed")`