# Apollo Mini Training System Design ## Overview This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space. ## Architecture ### Components 1. **Apollo Worker Daemon** (`apollo_worker.py`) - Listens over HTTP/HTTPS for training requests - Manages vLLM pause/resume cycle - Executes APOLLO-Mini training with `torch.enable_grad()` - Saves checkpoints and training metadata - Runs on the B200 server alongside vLLM 2. **Training Signal Agent** (to be built) - Runs online like surface-observe - Analyzes recent conversation windows - Identifies improvement opportunities - Requests training from Apollo Worker - Runs on Moria (separate from B200) 3. **vLLM Inference Engine** - Continues serving during non-training periods - Pauses during training steps - Shares GPU memory with training process ### Communication Protocol ``` POST /train { "training_data": { "samples": [ { "input": "conversation context", "expected_output": "better response", "rationale": "why this is better" } ], "config": { "learning_rate": 1e-5, "max_steps": 100 } } } Response: { "job_id": "job_20260331_012345_12345", "status": "accepted", "message": "Training job started" } GET /status/{job_id} Response: { "job_id": "job_20260331_012345_12345", "status": "completed", "training_samples": 50, "loss_history": [0.5, 0.45, 0.42, ...], "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt" } GET /checkpoints Response: { "checkpoints": [ { "filename": "checkpoint_job_20260331_012345_12345.pt", "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt", "created_at": "2026-03-31T01:23:45", "size": 55000000000 } ] } ``` ## Training Pipeline ### 1. Signal Detection - Training signal agent monitors conversation logs - Identifies patterns where PoC could improve: - Responses that needed memories to get right - Things that could be done better with more time/context - High-frequency memory accesses indicating knowledge gaps - Builds training dataset with input/expected_output/rationale ### 2. Training Request - Agent sends POST /train with training samples - Apollo Worker accepts job and begins execution ### 3. vLLM Pause - Apollo Worker signals vLLM to pause inference - vLLM freezes in-flight requests - GPU memory freed from KV cache becomes available ### 4. Model Loading & Training - Load model weights (shared with vLLM via memory mapping) - Enable gradients: `torch.enable_grad()` - Run APOLLO-Mini training loop: - Project gradients into rank-1 subspace - Update moments in projected space - Compute tensor-wise scaling factor - Apply updates to full gradient - Track loss history ### 5. Checkpoint Saving - Save model state dict - Record training metadata (samples, loss history, job ID) - Store in checkpoint directory ### 6. vLLM Resume - Signal vLLM to resume inference - KV cache rebuilt as new requests arrive - Updated weights now active in inference ## Memory Management ### APOLLO-Mini Advantages - **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW) - **Gradient memory**: Only for current batch (not full model) - **Activation memory**: Only for current training step - **Total overhead**: ~55GB for full fine-tuning, much less for LoRA ### vLLM Memory Reclamation - KV cache can consume 50-70% of GPU memory during inference - Pausing inference frees this memory for training - Training can use reclaimed space without evicting model weights ### Strategy 1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16) 2. **Time-multiplexed**: Pause inference, train, resume 3. **Nightly checkpoints**: Save full model state overnight when inference load is low ## Implementation Phases ### Phase 1: Prototype (Current) - [x] Apollo Worker daemon skeleton - [ ] vLLM pause/resume integration - [ ] Basic training loop with placeholder model - [ ] Checkpoint saving/loading - [ ] Test with small dataset ### Phase 2: Integration - [ ] Connect to actual Qwen3.5-27B model - [ ] Implement vLLM pause/resume API - [ ] Memory mapping for weight sharing - [ ] Training signal agent MVP - [ ] End-to-end test with real conversations ### Phase 3: Production - [ ] APOLLO-Mini implementation (rank-1 projection) - [ ] LoRA adapter integration - [ ] Nightly checkpoint scheduling - [ ] Training metrics and monitoring - [ ] Rollback mechanism for bad checkpoints ## Technical Challenges ### 1. vLLM Pause/Resume - vLLM's `pause_generation()` API needs testing - In-flight request handling during pause - KV cache invalidation strategy ### 2. Gradient Computation - `torch.inference_mode()` blocks gradients - Must override with `torch.enable_grad()` during training - CUDA graphs incompatible with training (use eager mode) ### 3. Memory Sharing - Model weights must be shared between vLLM and training process - Memory mapping or zero-copy IPC - Tensor parallelism consistency (if using TP) ### 4. APOLLO-Mini Implementation - Rank-1 gradient projection - Fixed random projection matrix (not SVD) - Tensor-wise scaling factor computation - Integration with existing optimizer infrastructure ## Next Steps 1. **Test vLLM pause/resume**: Verify API works and measure overhead 2. **Implement weight sharing**: Memory map model weights between processes 3. **Build training signal agent**: MVP that identifies improvement opportunities 4. **Test end-to-end**: Run training job with real conversation data 5. **Optimize**: Measure memory usage, training time, inference impact ## References - APOLLO-Mini paper: arXiv:2412.05270 - vLLM source: `/tmp/vllm/` - LeMix (interleaved training/inference): arXiv:2507.21276 - Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`