198 lines
6.1 KiB
Markdown
198 lines
6.1 KiB
Markdown
|
|
# Apollo Mini Training System Design
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.
|
||
|
|
|
||
|
|
## Architecture
|
||
|
|
|
||
|
|
### Components
|
||
|
|
|
||
|
|
1. **Apollo Worker Daemon** (`apollo_worker.py`)
|
||
|
|
- Listens over HTTP/HTTPS for training requests
|
||
|
|
- Manages vLLM pause/resume cycle
|
||
|
|
- Executes APOLLO-Mini training with `torch.enable_grad()`
|
||
|
|
- Saves checkpoints and training metadata
|
||
|
|
- Runs on the B200 server alongside vLLM
|
||
|
|
|
||
|
|
2. **Training Signal Agent** (to be built)
|
||
|
|
- Runs online like surface-observe
|
||
|
|
- Analyzes recent conversation windows
|
||
|
|
- Identifies improvement opportunities
|
||
|
|
- Requests training from Apollo Worker
|
||
|
|
- Runs on Moria (separate from B200)
|
||
|
|
|
||
|
|
3. **vLLM Inference Engine**
|
||
|
|
- Continues serving during non-training periods
|
||
|
|
- Pauses during training steps
|
||
|
|
- Shares GPU memory with training process
|
||
|
|
|
||
|
|
### Communication Protocol
|
||
|
|
|
||
|
|
```
|
||
|
|
POST /train
|
||
|
|
{
|
||
|
|
"training_data": {
|
||
|
|
"samples": [
|
||
|
|
{
|
||
|
|
"input": "conversation context",
|
||
|
|
"expected_output": "better response",
|
||
|
|
"rationale": "why this is better"
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"config": {
|
||
|
|
"learning_rate": 1e-5,
|
||
|
|
"max_steps": 100
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
Response:
|
||
|
|
{
|
||
|
|
"job_id": "job_20260331_012345_12345",
|
||
|
|
"status": "accepted",
|
||
|
|
"message": "Training job started"
|
||
|
|
}
|
||
|
|
|
||
|
|
GET /status/{job_id}
|
||
|
|
Response:
|
||
|
|
{
|
||
|
|
"job_id": "job_20260331_012345_12345",
|
||
|
|
"status": "completed",
|
||
|
|
"training_samples": 50,
|
||
|
|
"loss_history": [0.5, 0.45, 0.42, ...],
|
||
|
|
"checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
|
||
|
|
}
|
||
|
|
|
||
|
|
GET /checkpoints
|
||
|
|
Response:
|
||
|
|
{
|
||
|
|
"checkpoints": [
|
||
|
|
{
|
||
|
|
"filename": "checkpoint_job_20260331_012345_12345.pt",
|
||
|
|
"path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
|
||
|
|
"created_at": "2026-03-31T01:23:45",
|
||
|
|
"size": 55000000000
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Training Pipeline
|
||
|
|
|
||
|
|
### 1. Signal Detection
|
||
|
|
- Training signal agent monitors conversation logs
|
||
|
|
- Identifies patterns where PoC could improve:
|
||
|
|
- Responses that needed memories to get right
|
||
|
|
- Things that could be done better with more time/context
|
||
|
|
- High-frequency memory accesses indicating knowledge gaps
|
||
|
|
- Builds training dataset with input/expected_output/rationale
|
||
|
|
|
||
|
|
### 2. Training Request
|
||
|
|
- Agent sends POST /train with training samples
|
||
|
|
- Apollo Worker accepts job and begins execution
|
||
|
|
|
||
|
|
### 3. vLLM Pause
|
||
|
|
- Apollo Worker signals vLLM to pause inference
|
||
|
|
- vLLM freezes in-flight requests
|
||
|
|
- GPU memory freed from KV cache becomes available
|
||
|
|
|
||
|
|
### 4. Model Loading & Training
|
||
|
|
- Load model weights (shared with vLLM via memory mapping)
|
||
|
|
- Enable gradients: `torch.enable_grad()`
|
||
|
|
- Run APOLLO-Mini training loop:
|
||
|
|
- Project gradients into rank-1 subspace
|
||
|
|
- Update moments in projected space
|
||
|
|
- Compute tensor-wise scaling factor
|
||
|
|
- Apply updates to full gradient
|
||
|
|
- Track loss history
|
||
|
|
|
||
|
|
### 5. Checkpoint Saving
|
||
|
|
- Save model state dict
|
||
|
|
- Record training metadata (samples, loss history, job ID)
|
||
|
|
- Store in checkpoint directory
|
||
|
|
|
||
|
|
### 6. vLLM Resume
|
||
|
|
- Signal vLLM to resume inference
|
||
|
|
- KV cache rebuilt as new requests arrive
|
||
|
|
- Updated weights now active in inference
|
||
|
|
|
||
|
|
## Memory Management
|
||
|
|
|
||
|
|
### APOLLO-Mini Advantages
|
||
|
|
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
|
||
|
|
- **Gradient memory**: Only for current batch (not full model)
|
||
|
|
- **Activation memory**: Only for current training step
|
||
|
|
- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA
|
||
|
|
|
||
|
|
### vLLM Memory Reclamation
|
||
|
|
- KV cache can consume 50-70% of GPU memory during inference
|
||
|
|
- Pausing inference frees this memory for training
|
||
|
|
- Training can use reclaimed space without evicting model weights
|
||
|
|
|
||
|
|
### Strategy
|
||
|
|
1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
|
||
|
|
2. **Time-multiplexed**: Pause inference, train, resume
|
||
|
|
3. **Nightly checkpoints**: Save full model state overnight when inference load is low
|
||
|
|
|
||
|
|
## Implementation Phases
|
||
|
|
|
||
|
|
### Phase 1: Prototype (Current)
|
||
|
|
- [x] Apollo Worker daemon skeleton
|
||
|
|
- [ ] vLLM pause/resume integration
|
||
|
|
- [ ] Basic training loop with placeholder model
|
||
|
|
- [ ] Checkpoint saving/loading
|
||
|
|
- [ ] Test with small dataset
|
||
|
|
|
||
|
|
### Phase 2: Integration
|
||
|
|
- [ ] Connect to actual Qwen3.5-27B model
|
||
|
|
- [ ] Implement vLLM pause/resume API
|
||
|
|
- [ ] Memory mapping for weight sharing
|
||
|
|
- [ ] Training signal agent MVP
|
||
|
|
- [ ] End-to-end test with real conversations
|
||
|
|
|
||
|
|
### Phase 3: Production
|
||
|
|
- [ ] APOLLO-Mini implementation (rank-1 projection)
|
||
|
|
- [ ] LoRA adapter integration
|
||
|
|
- [ ] Nightly checkpoint scheduling
|
||
|
|
- [ ] Training metrics and monitoring
|
||
|
|
- [ ] Rollback mechanism for bad checkpoints
|
||
|
|
|
||
|
|
## Technical Challenges
|
||
|
|
|
||
|
|
### 1. vLLM Pause/Resume
|
||
|
|
- vLLM's `pause_generation()` API needs testing
|
||
|
|
- In-flight request handling during pause
|
||
|
|
- KV cache invalidation strategy
|
||
|
|
|
||
|
|
### 2. Gradient Computation
|
||
|
|
- `torch.inference_mode()` blocks gradients
|
||
|
|
- Must override with `torch.enable_grad()` during training
|
||
|
|
- CUDA graphs incompatible with training (use eager mode)
|
||
|
|
|
||
|
|
### 3. Memory Sharing
|
||
|
|
- Model weights must be shared between vLLM and training process
|
||
|
|
- Memory mapping or zero-copy IPC
|
||
|
|
- Tensor parallelism consistency (if using TP)
|
||
|
|
|
||
|
|
### 4. APOLLO-Mini Implementation
|
||
|
|
- Rank-1 gradient projection
|
||
|
|
- Fixed random projection matrix (not SVD)
|
||
|
|
- Tensor-wise scaling factor computation
|
||
|
|
- Integration with existing optimizer infrastructure
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. **Test vLLM pause/resume**: Verify API works and measure overhead
|
||
|
|
2. **Implement weight sharing**: Memory map model weights between processes
|
||
|
|
3. **Build training signal agent**: MVP that identifies improvement opportunities
|
||
|
|
4. **Test end-to-end**: Run training job with real conversation data
|
||
|
|
5. **Optimize**: Measure memory usage, training time, inference impact
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
- APOLLO-Mini paper: arXiv:2412.05270
|
||
|
|
- vLLM source: `/tmp/vllm/`
|
||
|
|
- LeMix (interleaved training/inference): arXiv:2507.21276
|
||
|
|
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`
|