consciousness

kent/consciousness

Fork 1

Commit graph

Author	SHA1	Message	Date
ProofOfConcept	2c6a5c0f4a	training: move to dedicated subprocess with ZMQ communication - Add training_worker.py: long-lived subprocess that handles GPU training work, owns HF model wrapper (views into vLLM GPU memory), Apollo optimizer, and checkpoint sync - train_router.py: now forwards /train requests via async ZMQ instead of running training in-process. Adds /checkpoint and /train/status endpoints - export_hook.py: store model_path in __metadata__ so training worker can find it without cross-process communication - This fixes two bugs: 1. Process boundary issue - model_path was set in worker process but needed in API server process 2. Blocking event loop - training blocked vLLM's async event loop Architecture: vLLM API server <-> ZMQ <-> training subprocess The subprocess loads IPC handles once, creates views into vLLM's GPU memory, and handles training requests without blocking inference. Co-Authored-By: Proof of Concept <poc@bcachefs.org>	2026-04-16 02:04:26 -04:00
Kent Overstreet	a73bcf5ae3	training: restructure as vLLM plugin package - Convert to installable package with entry points for vLLM auto-discovery - Add checkpoint_sync.py: Python replacement for Rust checkpoint binary - Block-level diffing of safetensors files (4KB blocks) - vLLM→HF weight name conversion built-in - Scheduled 10min after training jobs (batched) - API change: /train now takes raw token IDs (context_ids + continuation_ids) - No tokenizer on training side, client owns tokenization - Remove superseded code: standalone scripts, Rust binary, tokenizer helpers Install: pip install -e ./training Then vLLM auto-loads via entry point. Co-Authored-By: Proof of Concept <poc@bcachefs.org>	2026-04-15 23:16:53 -04:00

Author

SHA1

Message

Date

ProofOfConcept

2c6a5c0f4a

training: move to dedicated subprocess with ZMQ communication

- Add training_worker.py: long-lived subprocess that handles GPU training
  work, owns HF model wrapper (views into vLLM GPU memory), Apollo
  optimizer, and checkpoint sync

- train_router.py: now forwards /train requests via async ZMQ instead of
  running training in-process. Adds /checkpoint and /train/status endpoints

- export_hook.py: store model_path in __metadata__ so training worker can
  find it without cross-process communication

- This fixes two bugs:
  1. Process boundary issue - model_path was set in worker process but
     needed in API server process
  2. Blocking event loop - training blocked vLLM's async event loop

Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

2026-04-16 02:04:26 -04:00

Kent Overstreet

a73bcf5ae3

training: restructure as vLLM plugin package

- Convert to installable package with entry points for vLLM auto-discovery
- Add checkpoint_sync.py: Python replacement for Rust checkpoint binary
  - Block-level diffing of safetensors files (4KB blocks)
  - vLLM→HF weight name conversion built-in
  - Scheduled 10min after training jobs (batched)
- API change: /train now takes raw token IDs (context_ids + continuation_ids)
  - No tokenizer on training side, client owns tokenization
- Remove superseded code: standalone scripts, Rust binary, tokenizer helpers

Install: pip install -e ./training
Then vLLM auto-loads via entry point.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

2026-04-15 23:16:53 -04:00

2 commits