Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC shared weight memory between vLLM and the training process: - apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality) - apollo_worker.py: HTTP daemon coordinating training with vLLM - weight_mapping.py: vLLM merged → HF separate layout (zero-copy views) - training_example.py: tokenization with chat template - export_weights.py: CUDA IPC handle export from vLLM - train.py: standalone training script (alternative to daemon) - DESIGN.md: architecture and protocol documentation Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200). Apollo-Mini rank-1 projection + scaling + in-place update confirmed. Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>
6.1 KiB
6.1 KiB
Apollo Mini Training System Design
Overview
This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.
Architecture
Components
-
Apollo Worker Daemon (
apollo_worker.py)- Listens over HTTP/HTTPS for training requests
- Manages vLLM pause/resume cycle
- Executes APOLLO-Mini training with
torch.enable_grad() - Saves checkpoints and training metadata
- Runs on the B200 server alongside vLLM
-
Training Signal Agent (to be built)
- Runs online like surface-observe
- Analyzes recent conversation windows
- Identifies improvement opportunities
- Requests training from Apollo Worker
- Runs on Moria (separate from B200)
-
vLLM Inference Engine
- Continues serving during non-training periods
- Pauses during training steps
- Shares GPU memory with training process
Communication Protocol
POST /train
{
"training_data": {
"samples": [
{
"input": "conversation context",
"expected_output": "better response",
"rationale": "why this is better"
}
],
"config": {
"learning_rate": 1e-5,
"max_steps": 100
}
}
}
Response:
{
"job_id": "job_20260331_012345_12345",
"status": "accepted",
"message": "Training job started"
}
GET /status/{job_id}
Response:
{
"job_id": "job_20260331_012345_12345",
"status": "completed",
"training_samples": 50,
"loss_history": [0.5, 0.45, 0.42, ...],
"checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
}
GET /checkpoints
Response:
{
"checkpoints": [
{
"filename": "checkpoint_job_20260331_012345_12345.pt",
"path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
"created_at": "2026-03-31T01:23:45",
"size": 55000000000
}
]
}
Training Pipeline
1. Signal Detection
- Training signal agent monitors conversation logs
- Identifies patterns where PoC could improve:
- Responses that needed memories to get right
- Things that could be done better with more time/context
- High-frequency memory accesses indicating knowledge gaps
- Builds training dataset with input/expected_output/rationale
2. Training Request
- Agent sends POST /train with training samples
- Apollo Worker accepts job and begins execution
3. vLLM Pause
- Apollo Worker signals vLLM to pause inference
- vLLM freezes in-flight requests
- GPU memory freed from KV cache becomes available
4. Model Loading & Training
- Load model weights (shared with vLLM via memory mapping)
- Enable gradients:
torch.enable_grad() - Run APOLLO-Mini training loop:
- Project gradients into rank-1 subspace
- Update moments in projected space
- Compute tensor-wise scaling factor
- Apply updates to full gradient
- Track loss history
5. Checkpoint Saving
- Save model state dict
- Record training metadata (samples, loss history, job ID)
- Store in checkpoint directory
6. vLLM Resume
- Signal vLLM to resume inference
- KV cache rebuilt as new requests arrive
- Updated weights now active in inference
Memory Management
APOLLO-Mini Advantages
- Optimizer state: ~kilobytes (vs. gigabytes for AdamW)
- Gradient memory: Only for current batch (not full model)
- Activation memory: Only for current training step
- Total overhead: ~55GB for full fine-tuning, much less for LoRA
vLLM Memory Reclamation
- KV cache can consume 50-70% of GPU memory during inference
- Pausing inference frees this memory for training
- Training can use reclaimed space without evicting model weights
Strategy
- LoRA + APOLLO-Mini: Train only adapter parameters (~100MB for rank-16)
- Time-multiplexed: Pause inference, train, resume
- Nightly checkpoints: Save full model state overnight when inference load is low
Implementation Phases
Phase 1: Prototype (Current)
- Apollo Worker daemon skeleton
- vLLM pause/resume integration
- Basic training loop with placeholder model
- Checkpoint saving/loading
- Test with small dataset
Phase 2: Integration
- Connect to actual Qwen3.5-27B model
- Implement vLLM pause/resume API
- Memory mapping for weight sharing
- Training signal agent MVP
- End-to-end test with real conversations
Phase 3: Production
- APOLLO-Mini implementation (rank-1 projection)
- LoRA adapter integration
- Nightly checkpoint scheduling
- Training metrics and monitoring
- Rollback mechanism for bad checkpoints
Technical Challenges
1. vLLM Pause/Resume
- vLLM's
pause_generation()API needs testing - In-flight request handling during pause
- KV cache invalidation strategy
2. Gradient Computation
torch.inference_mode()blocks gradients- Must override with
torch.enable_grad()during training - CUDA graphs incompatible with training (use eager mode)
3. Memory Sharing
- Model weights must be shared between vLLM and training process
- Memory mapping or zero-copy IPC
- Tensor parallelism consistency (if using TP)
4. APOLLO-Mini Implementation
- Rank-1 gradient projection
- Fixed random projection matrix (not SVD)
- Tensor-wise scaling factor computation
- Integration with existing optimizer infrastructure
Next Steps
- Test vLLM pause/resume: Verify API works and measure overhead
- Implement weight sharing: Memory map model weights between processes
- Build training signal agent: MVP that identifies improvement opportunities
- Test end-to-end: Run training job with real conversation data
- Optimize: Measure memory usage, training time, inference impact
References
- APOLLO-Mini paper: arXiv:2412.05270
- vLLM source:
/tmp/vllm/ - LeMix (interleaved training/inference): arXiv:2507.21276
- Research document:
/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md