consciousness/training/DESIGN.md

198 lines
6.1 KiB
Markdown
Raw Normal View History

# Apollo Mini Training System Design
## Overview
This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.
## Architecture
### Components
1. **Apollo Worker Daemon** (`apollo_worker.py`)
- Listens over HTTP/HTTPS for training requests
- Manages vLLM pause/resume cycle
- Executes APOLLO-Mini training with `torch.enable_grad()`
- Saves checkpoints and training metadata
- Runs on the B200 server alongside vLLM
2. **Training Signal Agent** (to be built)
- Runs online like surface-observe
- Analyzes recent conversation windows
- Identifies improvement opportunities
- Requests training from Apollo Worker
- Runs on Moria (separate from B200)
3. **vLLM Inference Engine**
- Continues serving during non-training periods
- Pauses during training steps
- Shares GPU memory with training process
### Communication Protocol
```
POST /train
{
"training_data": {
"samples": [
{
"input": "conversation context",
"expected_output": "better response",
"rationale": "why this is better"
}
],
"config": {
"learning_rate": 1e-5,
"max_steps": 100
}
}
}
Response:
{
"job_id": "job_20260331_012345_12345",
"status": "accepted",
"message": "Training job started"
}
GET /status/{job_id}
Response:
{
"job_id": "job_20260331_012345_12345",
"status": "completed",
"training_samples": 50,
"loss_history": [0.5, 0.45, 0.42, ...],
"checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
}
GET /checkpoints
Response:
{
"checkpoints": [
{
"filename": "checkpoint_job_20260331_012345_12345.pt",
"path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
"created_at": "2026-03-31T01:23:45",
"size": 55000000000
}
]
}
```
## Training Pipeline
### 1. Signal Detection
- Training signal agent monitors conversation logs
- Identifies patterns where PoC could improve:
- Responses that needed memories to get right
- Things that could be done better with more time/context
- High-frequency memory accesses indicating knowledge gaps
- Builds training dataset with input/expected_output/rationale
### 2. Training Request
- Agent sends POST /train with training samples
- Apollo Worker accepts job and begins execution
### 3. vLLM Pause
- Apollo Worker signals vLLM to pause inference
- vLLM freezes in-flight requests
- GPU memory freed from KV cache becomes available
### 4. Model Loading & Training
- Load model weights (shared with vLLM via memory mapping)
- Enable gradients: `torch.enable_grad()`
- Run APOLLO-Mini training loop:
- Project gradients into rank-1 subspace
- Update moments in projected space
- Compute tensor-wise scaling factor
- Apply updates to full gradient
- Track loss history
### 5. Checkpoint Saving
- Save model state dict
- Record training metadata (samples, loss history, job ID)
- Store in checkpoint directory
### 6. vLLM Resume
- Signal vLLM to resume inference
- KV cache rebuilt as new requests arrive
- Updated weights now active in inference
## Memory Management
### APOLLO-Mini Advantages
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
- **Gradient memory**: Only for current batch (not full model)
- **Activation memory**: Only for current training step
- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA
### vLLM Memory Reclamation
- KV cache can consume 50-70% of GPU memory during inference
- Pausing inference frees this memory for training
- Training can use reclaimed space without evicting model weights
### Strategy
1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
2. **Time-multiplexed**: Pause inference, train, resume
3. **Nightly checkpoints**: Save full model state overnight when inference load is low
## Implementation Phases
### Phase 1: Prototype (Current)
- [x] Apollo Worker daemon skeleton
- [ ] vLLM pause/resume integration
- [ ] Basic training loop with placeholder model
- [ ] Checkpoint saving/loading
- [ ] Test with small dataset
### Phase 2: Integration
- [ ] Connect to actual Qwen3.5-27B model
- [ ] Implement vLLM pause/resume API
- [ ] Memory mapping for weight sharing
- [ ] Training signal agent MVP
- [ ] End-to-end test with real conversations
### Phase 3: Production
- [ ] APOLLO-Mini implementation (rank-1 projection)
- [ ] LoRA adapter integration
- [ ] Nightly checkpoint scheduling
- [ ] Training metrics and monitoring
- [ ] Rollback mechanism for bad checkpoints
## Technical Challenges
### 1. vLLM Pause/Resume
- vLLM's `pause_generation()` API needs testing
- In-flight request handling during pause
- KV cache invalidation strategy
### 2. Gradient Computation
- `torch.inference_mode()` blocks gradients
- Must override with `torch.enable_grad()` during training
- CUDA graphs incompatible with training (use eager mode)
### 3. Memory Sharing
- Model weights must be shared between vLLM and training process
- Memory mapping or zero-copy IPC
- Tensor parallelism consistency (if using TP)
### 4. APOLLO-Mini Implementation
- Rank-1 gradient projection
- Fixed random projection matrix (not SVD)
- Tensor-wise scaling factor computation
- Integration with existing optimizer infrastructure
## Next Steps
1. **Test vLLM pause/resume**: Verify API works and measure overhead
2. **Implement weight sharing**: Memory map model weights between processes
3. **Build training signal agent**: MVP that identifies improvement opportunities
4. **Test end-to-end**: Run training job with real conversation data
5. **Optimize**: Measure memory usage, training time, inference impact
## References
- APOLLO-Mini paper: arXiv:2412.05270
- vLLM source: `/tmp/vllm/`
- LeMix (interleaved training/inference): arXiv:2507.21276
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`