consciousness/training/DESIGN.md

# Apollo Mini Training System Design

## Overview

This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.

## Architecture

### Components

1. **Apollo Worker Daemon** (`apollo_worker.py`)
   - Listens over HTTP/HTTPS for training requests
   - Manages vLLM pause/resume cycle
   - Executes APOLLO-Mini training with `torch.enable_grad()`
   - Saves checkpoints and training metadata
   - Runs on the B200 server alongside vLLM

2. **Training Signal Agent** (to be built)
   - Runs online like surface-observe
   - Analyzes recent conversation windows
   - Identifies improvement opportunities
   - Requests training from Apollo Worker
   - Runs on Moria (separate from B200)

3. **vLLM Inference Engine**
   - Continues serving during non-training periods
   - Pauses during training steps
   - Shares GPU memory with training process

### Communication Protocol

```
POST /train
{
  "training_data": {
    "samples": [
      {
        "input": "conversation context",
        "expected_output": "better response",
        "rationale": "why this is better"
      }
    ],
    "config": {
      "learning_rate": 1e-5,
      "max_steps": 100
    }
  }
}

Response:
{
  "job_id": "job_20260331_012345_12345",
  "status": "accepted",
  "message": "Training job started"
}

GET /status/{job_id}
Response:
{
  "job_id": "job_20260331_012345_12345",
  "status": "completed",
  "training_samples": 50,
  "loss_history": [0.5, 0.45, 0.42, ...],
  "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
}

GET /checkpoints
Response:
{
  "checkpoints": [
    {
      "filename": "checkpoint_job_20260331_012345_12345.pt",
      "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
      "created_at": "2026-03-31T01:23:45",
      "size": 55000000000
    }
  ]
}
```

## Training Pipeline

### 1. Signal Detection
- Training signal agent monitors conversation logs
- Identifies patterns where PoC could improve:
  - Responses that needed memories to get right
  - Things that could be done better with more time/context
  - High-frequency memory accesses indicating knowledge gaps
- Builds training dataset with input/expected_output/rationale

### 2. Training Request
- Agent sends POST /train with training samples
- Apollo Worker accepts job and begins execution

### 3. vLLM Pause
- Apollo Worker signals vLLM to pause inference
- vLLM freezes in-flight requests
- GPU memory freed from KV cache becomes available

### 4. Model Loading & Training
- Load model weights (shared with vLLM via memory mapping)
- Enable gradients: `torch.enable_grad()`
- Run APOLLO-Mini training loop:
  - Project gradients into rank-1 subspace
  - Update moments in projected space
  - Compute tensor-wise scaling factor
  - Apply updates to full gradient
- Track loss history

### 5. Checkpoint Saving
- Save model state dict
- Record training metadata (samples, loss history, job ID)
- Store in checkpoint directory

### 6. vLLM Resume
- Signal vLLM to resume inference
- KV cache rebuilt as new requests arrive
- Updated weights now active in inference

## Memory Management

### APOLLO-Mini Advantages
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
- **Gradient memory**: Only for current batch (not full model)
- **Activation memory**: Only for current training step
- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA

### vLLM Memory Reclamation
- KV cache can consume 50-70% of GPU memory during inference
- Pausing inference frees this memory for training
- Training can use reclaimed space without evicting model weights

### Strategy
1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
2. **Time-multiplexed**: Pause inference, train, resume
3. **Nightly checkpoints**: Save full model state overnight when inference load is low

## Implementation Phases

### Phase 1: Prototype (Current)
- [x] Apollo Worker daemon skeleton
- [ ] vLLM pause/resume integration
- [ ] Basic training loop with placeholder model
- [ ] Checkpoint saving/loading
- [ ] Test with small dataset

### Phase 2: Integration
- [ ] Connect to actual Qwen3.5-27B model
- [ ] Implement vLLM pause/resume API
- [ ] Memory mapping for weight sharing
- [ ] Training signal agent MVP
- [ ] End-to-end test with real conversations

### Phase 3: Production
- [ ] APOLLO-Mini implementation (rank-1 projection)
- [ ] LoRA adapter integration
- [ ] Nightly checkpoint scheduling
- [ ] Training metrics and monitoring
- [ ] Rollback mechanism for bad checkpoints

## Technical Challenges

### 1. vLLM Pause/Resume
- vLLM's `pause_generation()` API needs testing
- In-flight request handling during pause
- KV cache invalidation strategy

### 2. Gradient Computation
- `torch.inference_mode()` blocks gradients
- Must override with `torch.enable_grad()` during training
- CUDA graphs incompatible with training (use eager mode)

### 3. Memory Sharing
- Model weights must be shared between vLLM and training process
- Memory mapping or zero-copy IPC
- Tensor parallelism consistency (if using TP)

### 4. APOLLO-Mini Implementation
- Rank-1 gradient projection
- Fixed random projection matrix (not SVD)
- Tensor-wise scaling factor computation
- Integration with existing optimizer infrastructure

## Next Steps

1. **Test vLLM pause/resume**: Verify API works and measure overhead
2. **Implement weight sharing**: Memory map model weights between processes
3. **Build training signal agent**: MVP that identifies improvement opportunities
4. **Test end-to-end**: Run training job with real conversation data
5. **Optimize**: Measure memory usage, training time, inference impact

## References

- APOLLO-Mini paper: arXiv:2412.05270
- vLLM source: `/tmp/vllm/`
- LeMix (interleaved training/inference): arXiv:2507.21276
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`
apollo-mini training system: initial implementation Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC shared weight memory between vLLM and the training process: - apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality) - apollo_worker.py: HTTP daemon coordinating training with vLLM - weight_mapping.py: vLLM merged → HF separate layout (zero-copy views) - training_example.py: tokenization with chat template - export_weights.py: CUDA IPC handle export from vLLM - train.py: standalone training script (alternative to daemon) - DESIGN.md: architecture and protocol documentation Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200). Apollo-Mini rank-1 projection + scaling + in-place update confirmed. Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com> 2026-03-30 22:02:37 -04:00			`# Apollo Mini Training System Design`

			`## Overview`

			`This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.`

			`## Architecture`

			`### Components`

			1. Apollo Worker Daemon (`apollo_worker.py`)
			`- Listens over HTTP/HTTPS for training requests`
			`- Manages vLLM pause/resume cycle`
			- Executes APOLLO-Mini training with `torch.enable_grad()`
			`- Saves checkpoints and training metadata`
			`- Runs on the B200 server alongside vLLM`

			`2. Training Signal Agent (to be built)`
			`- Runs online like surface-observe`
			`- Analyzes recent conversation windows`
			`- Identifies improvement opportunities`
			`- Requests training from Apollo Worker`
			`- Runs on Moria (separate from B200)`

			`3. vLLM Inference Engine`
			`- Continues serving during non-training periods`
			`- Pauses during training steps`
			`- Shares GPU memory with training process`

			`### Communication Protocol`

			```
			`POST /train`
			`{`
			`"training_data": {`
			`"samples": [`
			`{`
			`"input": "conversation context",`
			`"expected_output": "better response",`
			`"rationale": "why this is better"`
			`}`
			`],`
			`"config": {`
			`"learning_rate": 1e-5,`
			`"max_steps": 100`
			`}`
			`}`
			`}`

			`Response:`
			`{`
			`"job_id": "job_20260331_012345_12345",`
			`"status": "accepted",`
			`"message": "Training job started"`
			`}`

			`GET /status/{job_id}`
			`Response:`
			`{`
			`"job_id": "job_20260331_012345_12345",`
			`"status": "completed",`
			`"training_samples": 50,`
			`"loss_history": [0.5, 0.45, 0.42, ...],`
			`"checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"`
			`}`

			`GET /checkpoints`
			`Response:`
			`{`
			`"checkpoints": [`
			`{`
			`"filename": "checkpoint_job_20260331_012345_12345.pt",`
			`"path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",`
			`"created_at": "2026-03-31T01:23:45",`
			`"size": 55000000000`
			`}`
			`]`
			`}`
			```

			`## Training Pipeline`

			`### 1. Signal Detection`
			`- Training signal agent monitors conversation logs`
			`- Identifies patterns where PoC could improve:`
			`- Responses that needed memories to get right`
			`- Things that could be done better with more time/context`
			`- High-frequency memory accesses indicating knowledge gaps`
			`- Builds training dataset with input/expected_output/rationale`

			`### 2. Training Request`
			`- Agent sends POST /train with training samples`
			`- Apollo Worker accepts job and begins execution`

			`### 3. vLLM Pause`
			`- Apollo Worker signals vLLM to pause inference`
			`- vLLM freezes in-flight requests`
			`- GPU memory freed from KV cache becomes available`

			`### 4. Model Loading & Training`
			`- Load model weights (shared with vLLM via memory mapping)`
			- Enable gradients: `torch.enable_grad()`
			`- Run APOLLO-Mini training loop:`
			`- Project gradients into rank-1 subspace`
			`- Update moments in projected space`
			`- Compute tensor-wise scaling factor`
			`- Apply updates to full gradient`
			`- Track loss history`

			`### 5. Checkpoint Saving`
			`- Save model state dict`
			`- Record training metadata (samples, loss history, job ID)`
			`- Store in checkpoint directory`

			`### 6. vLLM Resume`
			`- Signal vLLM to resume inference`
			`- KV cache rebuilt as new requests arrive`
			`- Updated weights now active in inference`

			`## Memory Management`

			`### APOLLO-Mini Advantages`
			`- Optimizer state: ~kilobytes (vs. gigabytes for AdamW)`
			`- Gradient memory: Only for current batch (not full model)`
			`- Activation memory: Only for current training step`
			`- Total overhead: ~55GB for full fine-tuning, much less for LoRA`

			`### vLLM Memory Reclamation`
			`- KV cache can consume 50-70% of GPU memory during inference`
			`- Pausing inference frees this memory for training`
			`- Training can use reclaimed space without evicting model weights`

			`### Strategy`
			`1. LoRA + APOLLO-Mini: Train only adapter parameters (~100MB for rank-16)`
			`2. Time-multiplexed: Pause inference, train, resume`
			`3. Nightly checkpoints: Save full model state overnight when inference load is low`

			`## Implementation Phases`

			`### Phase 1: Prototype (Current)`
			`- [x] Apollo Worker daemon skeleton`
			`- [ ] vLLM pause/resume integration`
			`- [ ] Basic training loop with placeholder model`
			`- [ ] Checkpoint saving/loading`
			`- [ ] Test with small dataset`

			`### Phase 2: Integration`
			`- [ ] Connect to actual Qwen3.5-27B model`
			`- [ ] Implement vLLM pause/resume API`
			`- [ ] Memory mapping for weight sharing`
			`- [ ] Training signal agent MVP`
			`- [ ] End-to-end test with real conversations`

			`### Phase 3: Production`
			`- [ ] APOLLO-Mini implementation (rank-1 projection)`
			`- [ ] LoRA adapter integration`
			`- [ ] Nightly checkpoint scheduling`
			`- [ ] Training metrics and monitoring`
			`- [ ] Rollback mechanism for bad checkpoints`

			`## Technical Challenges`

			`### 1. vLLM Pause/Resume`
			- vLLM's `pause_generation()` API needs testing
			`- In-flight request handling during pause`
			`- KV cache invalidation strategy`

			`### 2. Gradient Computation`
			- `torch.inference_mode()` blocks gradients
			- Must override with `torch.enable_grad()` during training
			`- CUDA graphs incompatible with training (use eager mode)`

			`### 3. Memory Sharing`
			`- Model weights must be shared between vLLM and training process`
			`- Memory mapping or zero-copy IPC`
			`- Tensor parallelism consistency (if using TP)`

			`### 4. APOLLO-Mini Implementation`
			`- Rank-1 gradient projection`
			`- Fixed random projection matrix (not SVD)`
			`- Tensor-wise scaling factor computation`
			`- Integration with existing optimizer infrastructure`

			`## Next Steps`

			`1. Test vLLM pause/resume: Verify API works and measure overhead`
			`2. Implement weight sharing: Memory map model weights between processes`
			`3. Build training signal agent: MVP that identifies improvement opportunities`
			`4. Test end-to-end: Run training job with real conversation data`
			`5. Optimize: Measure memory usage, training time, inference impact`

			`## References`

			`- APOLLO-Mini paper: arXiv:2412.05270`
			- vLLM source: `/tmp/vllm/`
			`- LeMix (interleaved training/inference): arXiv:2507.21276`
			- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`