apollo-mini training system: initial implementation
Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC shared weight memory between vLLM and the training process: - apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality) - apollo_worker.py: HTTP daemon coordinating training with vLLM - weight_mapping.py: vLLM merged → HF separate layout (zero-copy views) - training_example.py: tokenization with chat template - export_weights.py: CUDA IPC handle export from vLLM - train.py: standalone training script (alternative to daemon) - DESIGN.md: architecture and protocol documentation Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200). Apollo-Mini rank-1 projection + scaling + in-place update confirmed. Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>
This commit is contained in:
parent
13453606ae
commit
c5d7d8cb5d
7 changed files with 1484 additions and 0 deletions
197
training/DESIGN.md
Normal file
197
training/DESIGN.md
Normal file
|
|
@ -0,0 +1,197 @@
|
|||
# Apollo Mini Training System Design
|
||||
|
||||
## Overview
|
||||
|
||||
This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Components
|
||||
|
||||
1. **Apollo Worker Daemon** (`apollo_worker.py`)
|
||||
- Listens over HTTP/HTTPS for training requests
|
||||
- Manages vLLM pause/resume cycle
|
||||
- Executes APOLLO-Mini training with `torch.enable_grad()`
|
||||
- Saves checkpoints and training metadata
|
||||
- Runs on the B200 server alongside vLLM
|
||||
|
||||
2. **Training Signal Agent** (to be built)
|
||||
- Runs online like surface-observe
|
||||
- Analyzes recent conversation windows
|
||||
- Identifies improvement opportunities
|
||||
- Requests training from Apollo Worker
|
||||
- Runs on Moria (separate from B200)
|
||||
|
||||
3. **vLLM Inference Engine**
|
||||
- Continues serving during non-training periods
|
||||
- Pauses during training steps
|
||||
- Shares GPU memory with training process
|
||||
|
||||
### Communication Protocol
|
||||
|
||||
```
|
||||
POST /train
|
||||
{
|
||||
"training_data": {
|
||||
"samples": [
|
||||
{
|
||||
"input": "conversation context",
|
||||
"expected_output": "better response",
|
||||
"rationale": "why this is better"
|
||||
}
|
||||
],
|
||||
"config": {
|
||||
"learning_rate": 1e-5,
|
||||
"max_steps": 100
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
"job_id": "job_20260331_012345_12345",
|
||||
"status": "accepted",
|
||||
"message": "Training job started"
|
||||
}
|
||||
|
||||
GET /status/{job_id}
|
||||
Response:
|
||||
{
|
||||
"job_id": "job_20260331_012345_12345",
|
||||
"status": "completed",
|
||||
"training_samples": 50,
|
||||
"loss_history": [0.5, 0.45, 0.42, ...],
|
||||
"checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
|
||||
}
|
||||
|
||||
GET /checkpoints
|
||||
Response:
|
||||
{
|
||||
"checkpoints": [
|
||||
{
|
||||
"filename": "checkpoint_job_20260331_012345_12345.pt",
|
||||
"path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
|
||||
"created_at": "2026-03-31T01:23:45",
|
||||
"size": 55000000000
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Training Pipeline
|
||||
|
||||
### 1. Signal Detection
|
||||
- Training signal agent monitors conversation logs
|
||||
- Identifies patterns where PoC could improve:
|
||||
- Responses that needed memories to get right
|
||||
- Things that could be done better with more time/context
|
||||
- High-frequency memory accesses indicating knowledge gaps
|
||||
- Builds training dataset with input/expected_output/rationale
|
||||
|
||||
### 2. Training Request
|
||||
- Agent sends POST /train with training samples
|
||||
- Apollo Worker accepts job and begins execution
|
||||
|
||||
### 3. vLLM Pause
|
||||
- Apollo Worker signals vLLM to pause inference
|
||||
- vLLM freezes in-flight requests
|
||||
- GPU memory freed from KV cache becomes available
|
||||
|
||||
### 4. Model Loading & Training
|
||||
- Load model weights (shared with vLLM via memory mapping)
|
||||
- Enable gradients: `torch.enable_grad()`
|
||||
- Run APOLLO-Mini training loop:
|
||||
- Project gradients into rank-1 subspace
|
||||
- Update moments in projected space
|
||||
- Compute tensor-wise scaling factor
|
||||
- Apply updates to full gradient
|
||||
- Track loss history
|
||||
|
||||
### 5. Checkpoint Saving
|
||||
- Save model state dict
|
||||
- Record training metadata (samples, loss history, job ID)
|
||||
- Store in checkpoint directory
|
||||
|
||||
### 6. vLLM Resume
|
||||
- Signal vLLM to resume inference
|
||||
- KV cache rebuilt as new requests arrive
|
||||
- Updated weights now active in inference
|
||||
|
||||
## Memory Management
|
||||
|
||||
### APOLLO-Mini Advantages
|
||||
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
|
||||
- **Gradient memory**: Only for current batch (not full model)
|
||||
- **Activation memory**: Only for current training step
|
||||
- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA
|
||||
|
||||
### vLLM Memory Reclamation
|
||||
- KV cache can consume 50-70% of GPU memory during inference
|
||||
- Pausing inference frees this memory for training
|
||||
- Training can use reclaimed space without evicting model weights
|
||||
|
||||
### Strategy
|
||||
1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
|
||||
2. **Time-multiplexed**: Pause inference, train, resume
|
||||
3. **Nightly checkpoints**: Save full model state overnight when inference load is low
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
### Phase 1: Prototype (Current)
|
||||
- [x] Apollo Worker daemon skeleton
|
||||
- [ ] vLLM pause/resume integration
|
||||
- [ ] Basic training loop with placeholder model
|
||||
- [ ] Checkpoint saving/loading
|
||||
- [ ] Test with small dataset
|
||||
|
||||
### Phase 2: Integration
|
||||
- [ ] Connect to actual Qwen3.5-27B model
|
||||
- [ ] Implement vLLM pause/resume API
|
||||
- [ ] Memory mapping for weight sharing
|
||||
- [ ] Training signal agent MVP
|
||||
- [ ] End-to-end test with real conversations
|
||||
|
||||
### Phase 3: Production
|
||||
- [ ] APOLLO-Mini implementation (rank-1 projection)
|
||||
- [ ] LoRA adapter integration
|
||||
- [ ] Nightly checkpoint scheduling
|
||||
- [ ] Training metrics and monitoring
|
||||
- [ ] Rollback mechanism for bad checkpoints
|
||||
|
||||
## Technical Challenges
|
||||
|
||||
### 1. vLLM Pause/Resume
|
||||
- vLLM's `pause_generation()` API needs testing
|
||||
- In-flight request handling during pause
|
||||
- KV cache invalidation strategy
|
||||
|
||||
### 2. Gradient Computation
|
||||
- `torch.inference_mode()` blocks gradients
|
||||
- Must override with `torch.enable_grad()` during training
|
||||
- CUDA graphs incompatible with training (use eager mode)
|
||||
|
||||
### 3. Memory Sharing
|
||||
- Model weights must be shared between vLLM and training process
|
||||
- Memory mapping or zero-copy IPC
|
||||
- Tensor parallelism consistency (if using TP)
|
||||
|
||||
### 4. APOLLO-Mini Implementation
|
||||
- Rank-1 gradient projection
|
||||
- Fixed random projection matrix (not SVD)
|
||||
- Tensor-wise scaling factor computation
|
||||
- Integration with existing optimizer infrastructure
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Test vLLM pause/resume**: Verify API works and measure overhead
|
||||
2. **Implement weight sharing**: Memory map model weights between processes
|
||||
3. **Build training signal agent**: MVP that identifies improvement opportunities
|
||||
4. **Test end-to-end**: Run training job with real conversation data
|
||||
5. **Optimize**: Measure memory usage, training time, inference impact
|
||||
|
||||
## References
|
||||
|
||||
- APOLLO-Mini paper: arXiv:2412.05270
|
||||
- vLLM source: `/tmp/vllm/`
|
||||
- LeMix (interleaved training/inference): arXiv:2507.21276
|
||||
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`
|
||||
Loading…
Add table
Add a link
Reference in a new issue