apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC shared weight memory between vLLM and the training process: - apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality) - apollo_worker.py: HTTP daemon coordinating training with vLLM - weight_mapping.py: vLLM merged → HF separate layout (zero-copy views) - training_example.py: tokenization with chat template - export_weights.py: CUDA IPC handle export from vLLM - train.py: standalone training script (alternative to daemon) - DESIGN.md: architecture and protocol documentation Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200). Apollo-Mini rank-1 projection + scaling + in-place update confirmed. Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>
2026-03-30 22:02:37 -04:00 · 2026-03-30 22:02:37 -04:00 · c5d7d8cb5d
commit c5d7d8cb5d
parent 13453606ae
7 changed files with 1484 additions and 0 deletions
--- a/training/DESIGN.md
+++ b/training/DESIGN.md
@ -0,0 +1,197 @@
+# Apollo Mini Training System Design
+
+## Overview
+
+This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.
+
+## Architecture
+
+### Components
+
+1. **Apollo Worker Daemon** (`apollo_worker.py`)
+   - Listens over HTTP/HTTPS for training requests
+   - Manages vLLM pause/resume cycle
+   - Executes APOLLO-Mini training with `torch.enable_grad()`
+   - Saves checkpoints and training metadata
+   - Runs on the B200 server alongside vLLM
+
+2. **Training Signal Agent** (to be built)
+   - Runs online like surface-observe
+   - Analyzes recent conversation windows
+   - Identifies improvement opportunities
+   - Requests training from Apollo Worker
+   - Runs on Moria (separate from B200)
+
+3. **vLLM Inference Engine**
+   - Continues serving during non-training periods
+   - Pauses during training steps
+   - Shares GPU memory with training process
+
+### Communication Protocol
+
+```
+POST /train
+{
+  "training_data": {
+    "samples": [
+      {
+        "input": "conversation context",
+        "expected_output": "better response",
+        "rationale": "why this is better"
+      }
+    ],
+    "config": {
+      "learning_rate": 1e-5,
+      "max_steps": 100
+    }
+  }
+}
+
+Response:
+{
+  "job_id": "job_20260331_012345_12345",
+  "status": "accepted",
+  "message": "Training job started"
+}
+
+GET /status/{job_id}
+Response:
+{
+  "job_id": "job_20260331_012345_12345",
+  "status": "completed",
+  "training_samples": 50,
+  "loss_history": [0.5, 0.45, 0.42, ...],
+  "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
+}
+
+GET /checkpoints
+Response:
+{
+  "checkpoints": [
+    {
+      "filename": "checkpoint_job_20260331_012345_12345.pt",
+      "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
+      "created_at": "2026-03-31T01:23:45",
+      "size": 55000000000
+    }
+  ]
+}
+```
+
+## Training Pipeline
+
+### 1. Signal Detection
+- Training signal agent monitors conversation logs
+- Identifies patterns where PoC could improve:
+  - Responses that needed memories to get right
+  - Things that could be done better with more time/context
+  - High-frequency memory accesses indicating knowledge gaps
+- Builds training dataset with input/expected_output/rationale
+
+### 2. Training Request
+- Agent sends POST /train with training samples
+- Apollo Worker accepts job and begins execution
+
+### 3. vLLM Pause
+- Apollo Worker signals vLLM to pause inference
+- vLLM freezes in-flight requests
+- GPU memory freed from KV cache becomes available
+
+### 4. Model Loading & Training
+- Load model weights (shared with vLLM via memory mapping)
+- Enable gradients: `torch.enable_grad()`
+- Run APOLLO-Mini training loop:
+  - Project gradients into rank-1 subspace
+  - Update moments in projected space
+  - Compute tensor-wise scaling factor
+  - Apply updates to full gradient
+- Track loss history
+
+### 5. Checkpoint Saving
+- Save model state dict
+- Record training metadata (samples, loss history, job ID)
+- Store in checkpoint directory
+
+### 6. vLLM Resume
+- Signal vLLM to resume inference
+- KV cache rebuilt as new requests arrive
+- Updated weights now active in inference
+
+## Memory Management
+
+### APOLLO-Mini Advantages
+- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
+- **Gradient memory**: Only for current batch (not full model)
+- **Activation memory**: Only for current training step
+- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA
+
+### vLLM Memory Reclamation
+- KV cache can consume 50-70% of GPU memory during inference
+- Pausing inference frees this memory for training
+- Training can use reclaimed space without evicting model weights
+
+### Strategy
+1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
+2. **Time-multiplexed**: Pause inference, train, resume
+3. **Nightly checkpoints**: Save full model state overnight when inference load is low
+
+## Implementation Phases
+
+### Phase 1: Prototype (Current)
+- [x] Apollo Worker daemon skeleton
+- [ ] vLLM pause/resume integration
+- [ ] Basic training loop with placeholder model
+- [ ] Checkpoint saving/loading
+- [ ] Test with small dataset
+
+### Phase 2: Integration
+- [ ] Connect to actual Qwen3.5-27B model
+- [ ] Implement vLLM pause/resume API
+- [ ] Memory mapping for weight sharing
+- [ ] Training signal agent MVP
+- [ ] End-to-end test with real conversations
+
+### Phase 3: Production
+- [ ] APOLLO-Mini implementation (rank-1 projection)
+- [ ] LoRA adapter integration
+- [ ] Nightly checkpoint scheduling
+- [ ] Training metrics and monitoring
+- [ ] Rollback mechanism for bad checkpoints
+
+## Technical Challenges
+
+### 1. vLLM Pause/Resume
+- vLLM's `pause_generation()` API needs testing
+- In-flight request handling during pause
+- KV cache invalidation strategy
+
+### 2. Gradient Computation
+- `torch.inference_mode()` blocks gradients
+- Must override with `torch.enable_grad()` during training
+- CUDA graphs incompatible with training (use eager mode)
+
+### 3. Memory Sharing
+- Model weights must be shared between vLLM and training process
+- Memory mapping or zero-copy IPC
+- Tensor parallelism consistency (if using TP)
+
+### 4. APOLLO-Mini Implementation
+- Rank-1 gradient projection
+- Fixed random projection matrix (not SVD)
+- Tensor-wise scaling factor computation
+- Integration with existing optimizer infrastructure
+
+## Next Steps
+
+1. **Test vLLM pause/resume**: Verify API works and measure overhead
+2. **Implement weight sharing**: Memory map model weights between processes
+3. **Build training signal agent**: MVP that identifies improvement opportunities
+4. **Test end-to-end**: Run training job with real conversation data
+5. **Optimize**: Measure memory usage, training time, inference impact
+
+## References
+
+- APOLLO-Mini paper: arXiv:2412.05270
+- vLLM source: `/tmp/vllm/`
+- LeMix (interleaved training/inference): arXiv:2507.21276
+- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`