# Apollo Mini Training System Design

## Overview

This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.

## Architecture

### Components

1. **Apollo Worker Daemon** (`apollo_worker.py`)
   - Listens over HTTP/HTTPS for training requests
   - Manages vLLM pause/resume cycle
   - Executes APOLLO-Mini training with `torch.enable_grad()`
   - Saves checkpoints and training metadata
   - Runs on the B200 server alongside vLLM

2. **Training Signal Agent** (to be built)
   - Runs online like surface-observe
   - Analyzes recent conversation windows
   - Identifies improvement opportunities
   - Requests training from Apollo Worker
   - Runs on Moria (separate from B200)

3. **vLLM Inference Engine**
   - Continues serving during non-training periods
   - Pauses during training steps
   - Shares GPU memory with training process

### Communication Protocol

```
POST /train
{
  "training_data": {
    "samples": [
      {
        "input": "conversation context",
        "expected_output": "better response",
        "rationale": "why this is better"
      }
    ],
    "config": {
      "learning_rate": 1e-5,
      "max_steps": 100
    }
  }
}

Response:
{
  "job_id": "job_20260331_012345_12345",
  "status": "accepted",
  "message": "Training job started"
}

GET /status/{job_id}
Response:
{
  "job_id": "job_20260331_012345_12345",
  "status": "completed",
  "training_samples": 50,
  "loss_history": [0.5, 0.45, 0.42, ...],
  "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
}

GET /checkpoints
Response:
{
  "checkpoints": [
    {
      "filename": "checkpoint_job_20260331_012345_12345.pt",
      "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
      "created_at": "2026-03-31T01:23:45",
      "size": 55000000000
    }
  ]
}
```

## Training Pipeline

### 1. Signal Detection
- Training signal agent monitors conversation logs
- Identifies patterns where PoC could improve:
  - Responses that needed memories to get right
  - Things that could be done better with more time/context
  - High-frequency memory accesses indicating knowledge gaps
- Builds training dataset with input/expected_output/rationale

### 2. Training Request
- Agent sends POST /train with training samples
- Apollo Worker accepts job and begins execution

### 3. vLLM Pause
- Apollo Worker signals vLLM to pause inference
- vLLM freezes in-flight requests
- GPU memory freed from KV cache becomes available

### 4. Model Loading & Training
- Load model weights (shared with vLLM via memory mapping)
- Enable gradients: `torch.enable_grad()`
- Run APOLLO-Mini training loop:
  - Project gradients into rank-1 subspace
  - Update moments in projected space
  - Compute tensor-wise scaling factor
  - Apply updates to full gradient
- Track loss history

### 5. Checkpoint Saving
- Save model state dict
- Record training metadata (samples, loss history, job ID)
- Store in checkpoint directory

### 6. vLLM Resume
- Signal vLLM to resume inference
- KV cache rebuilt as new requests arrive
- Updated weights now active in inference

## Memory Management

### APOLLO-Mini Advantages
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
- **Gradient memory**: Only for current batch (not full model)
- **Activation memory**: Only for current training step
- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA

### vLLM Memory Reclamation
- KV cache can consume 50-70% of GPU memory during inference
- Pausing inference frees this memory for training
- Training can use reclaimed space without evicting model weights

### Strategy
1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
2. **Time-multiplexed**: Pause inference, train, resume
3. **Nightly checkpoints**: Save full model state overnight when inference load is low

## Implementation Phases

### Phase 1: Prototype (Current)
- [x] Apollo Worker daemon skeleton
- [ ] vLLM pause/resume integration
- [ ] Basic training loop with placeholder model
- [ ] Checkpoint saving/loading
- [ ] Test with small dataset

### Phase 2: Integration
- [ ] Connect to actual Qwen3.5-27B model
- [ ] Implement vLLM pause/resume API
- [ ] Memory mapping for weight sharing
- [ ] Training signal agent MVP
- [ ] End-to-end test with real conversations

### Phase 3: Production
- [ ] APOLLO-Mini implementation (rank-1 projection)
- [ ] LoRA adapter integration
- [ ] Nightly checkpoint scheduling
- [ ] Training metrics and monitoring
- [ ] Rollback mechanism for bad checkpoints

## Technical Challenges

### 1. vLLM Pause/Resume
- vLLM's `pause_generation()` API needs testing
- In-flight request handling during pause
- KV cache invalidation strategy

### 2. Gradient Computation
- `torch.inference_mode()` blocks gradients
- Must override with `torch.enable_grad()` during training
- CUDA graphs incompatible with training (use eager mode)

### 3. Memory Sharing
- Model weights must be shared between vLLM and training process
- Memory mapping or zero-copy IPC
- Tensor parallelism consistency (if using TP)

### 4. APOLLO-Mini Implementation
- Rank-1 gradient projection
- Fixed random projection matrix (not SVD)
- Tensor-wise scaling factor computation
- Integration with existing optimizer infrastructure

## Next Steps

1. **Test vLLM pause/resume**: Verify API works and measure overhead
2. **Implement weight sharing**: Memory map model weights between processes
3. **Build training signal agent**: MVP that identifies improvement opportunities
4. **Test end-to-end**: Run training job with real conversation data
5. **Optimize**: Measure memory usage, training time, inference impact

## References

- APOLLO-Mini paper: arXiv:2412.05270
- vLLM source: `/tmp/vllm/`
- LeMix (interleaved training/inference): arXiv:2507.21276
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`