consciousness/training/DESIGN.md
ProofOfConcept c5d7d8cb5d apollo-mini training system: initial implementation
Core components for online fine-tuning of Qwen3.5-27B with CUDA IPC
shared weight memory between vLLM and the training process:

- apollo_mini.py: rank-1 optimizer (SGD memory, AdamW quality)
- apollo_worker.py: HTTP daemon coordinating training with vLLM
- weight_mapping.py: vLLM merged → HF separate layout (zero-copy views)
- training_example.py: tokenization with chat template
- export_weights.py: CUDA IPC handle export from vLLM
- train.py: standalone training script (alternative to daemon)
- DESIGN.md: architecture and protocol documentation

Validated: CUDA IPC autograd works on real Qwen3.5 weights (B200).
Apollo-Mini rank-1 projection + scaling + in-place update confirmed.

Co-Authored-By: Kent Overstreet <kent.overstreet@gmail.com>
2026-03-30 22:02:37 -04:00

6.1 KiB

Apollo Mini Training System Design

Overview

This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.

Architecture

Components

  1. Apollo Worker Daemon (apollo_worker.py)

    • Listens over HTTP/HTTPS for training requests
    • Manages vLLM pause/resume cycle
    • Executes APOLLO-Mini training with torch.enable_grad()
    • Saves checkpoints and training metadata
    • Runs on the B200 server alongside vLLM
  2. Training Signal Agent (to be built)

    • Runs online like surface-observe
    • Analyzes recent conversation windows
    • Identifies improvement opportunities
    • Requests training from Apollo Worker
    • Runs on Moria (separate from B200)
  3. vLLM Inference Engine

    • Continues serving during non-training periods
    • Pauses during training steps
    • Shares GPU memory with training process

Communication Protocol

POST /train
{
  "training_data": {
    "samples": [
      {
        "input": "conversation context",
        "expected_output": "better response",
        "rationale": "why this is better"
      }
    ],
    "config": {
      "learning_rate": 1e-5,
      "max_steps": 100
    }
  }
}

Response:
{
  "job_id": "job_20260331_012345_12345",
  "status": "accepted",
  "message": "Training job started"
}

GET /status/{job_id}
Response:
{
  "job_id": "job_20260331_012345_12345",
  "status": "completed",
  "training_samples": 50,
  "loss_history": [0.5, 0.45, 0.42, ...],
  "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
}

GET /checkpoints
Response:
{
  "checkpoints": [
    {
      "filename": "checkpoint_job_20260331_012345_12345.pt",
      "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
      "created_at": "2026-03-31T01:23:45",
      "size": 55000000000
    }
  ]
}

Training Pipeline

1. Signal Detection

  • Training signal agent monitors conversation logs
  • Identifies patterns where PoC could improve:
    • Responses that needed memories to get right
    • Things that could be done better with more time/context
    • High-frequency memory accesses indicating knowledge gaps
  • Builds training dataset with input/expected_output/rationale

2. Training Request

  • Agent sends POST /train with training samples
  • Apollo Worker accepts job and begins execution

3. vLLM Pause

  • Apollo Worker signals vLLM to pause inference
  • vLLM freezes in-flight requests
  • GPU memory freed from KV cache becomes available

4. Model Loading & Training

  • Load model weights (shared with vLLM via memory mapping)
  • Enable gradients: torch.enable_grad()
  • Run APOLLO-Mini training loop:
    • Project gradients into rank-1 subspace
    • Update moments in projected space
    • Compute tensor-wise scaling factor
    • Apply updates to full gradient
  • Track loss history

5. Checkpoint Saving

  • Save model state dict
  • Record training metadata (samples, loss history, job ID)
  • Store in checkpoint directory

6. vLLM Resume

  • Signal vLLM to resume inference
  • KV cache rebuilt as new requests arrive
  • Updated weights now active in inference

Memory Management

APOLLO-Mini Advantages

  • Optimizer state: ~kilobytes (vs. gigabytes for AdamW)
  • Gradient memory: Only for current batch (not full model)
  • Activation memory: Only for current training step
  • Total overhead: ~55GB for full fine-tuning, much less for LoRA

vLLM Memory Reclamation

  • KV cache can consume 50-70% of GPU memory during inference
  • Pausing inference frees this memory for training
  • Training can use reclaimed space without evicting model weights

Strategy

  1. LoRA + APOLLO-Mini: Train only adapter parameters (~100MB for rank-16)
  2. Time-multiplexed: Pause inference, train, resume
  3. Nightly checkpoints: Save full model state overnight when inference load is low

Implementation Phases

Phase 1: Prototype (Current)

  • Apollo Worker daemon skeleton
  • vLLM pause/resume integration
  • Basic training loop with placeholder model
  • Checkpoint saving/loading
  • Test with small dataset

Phase 2: Integration

  • Connect to actual Qwen3.5-27B model
  • Implement vLLM pause/resume API
  • Memory mapping for weight sharing
  • Training signal agent MVP
  • End-to-end test with real conversations

Phase 3: Production

  • APOLLO-Mini implementation (rank-1 projection)
  • LoRA adapter integration
  • Nightly checkpoint scheduling
  • Training metrics and monitoring
  • Rollback mechanism for bad checkpoints

Technical Challenges

1. vLLM Pause/Resume

  • vLLM's pause_generation() API needs testing
  • In-flight request handling during pause
  • KV cache invalidation strategy

2. Gradient Computation

  • torch.inference_mode() blocks gradients
  • Must override with torch.enable_grad() during training
  • CUDA graphs incompatible with training (use eager mode)

3. Memory Sharing

  • Model weights must be shared between vLLM and training process
  • Memory mapping or zero-copy IPC
  • Tensor parallelism consistency (if using TP)

4. APOLLO-Mini Implementation

  • Rank-1 gradient projection
  • Fixed random projection matrix (not SVD)
  • Tensor-wise scaling factor computation
  • Integration with existing optimizer infrastructure

Next Steps

  1. Test vLLM pause/resume: Verify API works and measure overhead
  2. Implement weight sharing: Memory map model weights between processes
  3. Build training signal agent: MVP that identifies improvement opportunities
  4. Test end-to-end: Run training job with real conversation data
  5. Optimize: Measure memory usage, training time, inference impact

References

  • APOLLO-Mini paper: arXiv:2412.05270
  • vLLM source: /tmp/vllm/
  • LeMix (interleaved training/inference): arXiv:2507.21276
  • Research document: /home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md