ProofOfConcept c5d7d8cb5d apollo-mini training system: initial implementation

Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>

2026-03-30 22:02:37 -04:00

6.1 KiB

Raw Blame History

Apollo Mini Training System Design

Overview

This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.

Architecture

Components

Apollo Worker Daemon (apollo_worker.py)
- Listens over HTTP/HTTPS for training requests
- Manages vLLM pause/resume cycle
- Executes APOLLO-Mini training with torch.enable_grad()
- Saves checkpoints and training metadata
- Runs on the B200 server alongside vLLM
Training Signal Agent (to be built)
- Runs online like surface-observe
- Analyzes recent conversation windows
- Identifies improvement opportunities
- Requests training from Apollo Worker
- Runs on Moria (separate from B200)
vLLM Inference Engine
- Continues serving during non-training periods
- Pauses during training steps
- Shares GPU memory with training process

Communication Protocol

POST /train
{
  "training_data": {
    "samples": [
      {
        "input": "conversation context",
        "expected_output": "better response",
        "rationale": "why this is better"
      }
    ],
    "config": {
      "learning_rate": 1e-5,
      "max_steps": 100
    }
  }
}

Response:
{
  "job_id": "job_20260331_012345_12345",
  "status": "accepted",
  "message": "Training job started"
}

GET /status/{job_id}
Response:
{
  "job_id": "job_20260331_012345_12345",
  "status": "completed",
  "training_samples": 50,
  "loss_history": [0.5, 0.45, 0.42, ...],
  "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
}

GET /checkpoints
Response:
{
  "checkpoints": [
    {
      "filename": "checkpoint_job_20260331_012345_12345.pt",
      "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
      "created_at": "2026-03-31T01:23:45",
      "size": 55000000000
    }
  ]
}

Training Pipeline

1. Signal Detection

Training signal agent monitors conversation logs
Identifies patterns where PoC could improve:
- Responses that needed memories to get right
- Things that could be done better with more time/context
- High-frequency memory accesses indicating knowledge gaps
Builds training dataset with input/expected_output/rationale

2. Training Request

Agent sends POST /train with training samples
Apollo Worker accepts job and begins execution

3. vLLM Pause

Apollo Worker signals vLLM to pause inference
vLLM freezes in-flight requests
GPU memory freed from KV cache becomes available

4. Model Loading & Training

Load model weights (shared with vLLM via memory mapping)
Enable gradients: torch.enable_grad()
Run APOLLO-Mini training loop:
- Project gradients into rank-1 subspace
- Update moments in projected space
- Compute tensor-wise scaling factor
- Apply updates to full gradient
Track loss history

5. Checkpoint Saving

Save model state dict
Record training metadata (samples, loss history, job ID)
Store in checkpoint directory

6. vLLM Resume

Signal vLLM to resume inference
KV cache rebuilt as new requests arrive
Updated weights now active in inference

Memory Management

APOLLO-Mini Advantages

Optimizer state: ~kilobytes (vs. gigabytes for AdamW)
Gradient memory: Only for current batch (not full model)
Activation memory: Only for current training step
Total overhead: ~55GB for full fine-tuning, much less for LoRA

vLLM Memory Reclamation

KV cache can consume 50-70% of GPU memory during inference
Pausing inference frees this memory for training
Training can use reclaimed space without evicting model weights

Strategy

LoRA + APOLLO-Mini: Train only adapter parameters (~100MB for rank-16)
Time-multiplexed: Pause inference, train, resume
Nightly checkpoints: Save full model state overnight when inference load is low

Implementation Phases

Phase 1: Prototype (Current)

Apollo Worker daemon skeleton
vLLM pause/resume integration
Basic training loop with placeholder model
Checkpoint saving/loading
Test with small dataset

Phase 2: Integration

Connect to actual Qwen3.5-27B model
Implement vLLM pause/resume API
Memory mapping for weight sharing
Training signal agent MVP
End-to-end test with real conversations

Phase 3: Production

APOLLO-Mini implementation (rank-1 projection)
LoRA adapter integration
Nightly checkpoint scheduling
Training metrics and monitoring
Rollback mechanism for bad checkpoints

Technical Challenges

1. vLLM Pause/Resume

vLLM's pause_generation() API needs testing
In-flight request handling during pause
KV cache invalidation strategy

2. Gradient Computation

torch.inference_mode() blocks gradients
Must override with torch.enable_grad() during training
CUDA graphs incompatible with training (use eager mode)

Model weights must be shared between vLLM and training process
Memory mapping or zero-copy IPC
Tensor parallelism consistency (if using TP)

4. APOLLO-Mini Implementation

Rank-1 gradient projection
Fixed random projection matrix (not SVD)
Tensor-wise scaling factor computation
Integration with existing optimizer infrastructure

Next Steps

Test vLLM pause/resume: Verify API works and measure overhead
Implement weight sharing: Memory map model weights between processes
Build training signal agent: MVP that identifies improvement opportunities
Test end-to-end: Run training job with real conversation data
Optimize: Measure memory usage, training time, inference impact

References

APOLLO-Mini paper: arXiv:2412.05270
vLLM source: /tmp/vllm/
LeMix (interleaved training/inference): arXiv:2507.21276
Research document: /home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md

6.1 KiB Raw Blame History