Optimizer state (momentum, variance estimates) now persists between training sessions: - Saved to /tmp/apollo_optimizer_state.pt during checkpoint sync - Restored on next /train call if available - Preserves training continuity for incremental learning Previously each /train call started with fresh optimizer state, losing accumulated gradient history. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
13 KiB
Apollo Training System
Overview
Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference. Full-weight updates (not LoRA) using Apollo optimizer with rank-256 gradient projection. No pause required — HOGWILD concurrent training. Weights shared via CUDA IPC between vLLM and the training process.
The training signal comes from two sources:
- Direct examples — agent logs, conversation transcripts, flagged behavioral moments
- Dream-generated scenarios — the dream loop generates situations from recent experience; the model responds; good responses become training data with instructions stripped
Architecture
┌─────────────────────────────────────────────────────┐
│ GPU VRAM (192GB) │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Model Weights (54GB, bf16) │ │
│ │ Shared: vLLM inference + HF training │ │
│ └──────────────┬──────────────┬────────────────┘ │
│ │ │ │
│ ┌──────────────▼──┐ ┌───────▼────────────────┐ │
│ │ vLLM (inference)│ │ HF model (training) │ │
│ │ KV cache ~60GB │ │ Gradients ~54GB │ │
│ │ /completions │ │ Optimizer state ~10GB │ │
│ │ /score │ │ Views into vLLM weights │ │
│ │ /train ────────┼──┼─► Apollo optimizer │ │
│ └─────────────────┘ └────────────────────────┘ │
└─────────────────────────────────────────────────────┘
Single vLLM process serves everything
No separate daemon - /train is a vLLM route
Moria B200 (vLLM)
┌──────────────────┐ ┌──────────────────┐
│ Training signal │ HTTP │ /completions │
│ agent │──────────>│ /score │
│ │ │ /train │
│ Dream loop │ │ │
│ (generates │ │ Checkpoint sync │
│ scenarios) │ │ (10 min batched) │
└──────────────────┘ └──────────────────┘
Key Decisions
No pause needed (HOGWILD)
Training updates weights in-place while vLLM serves. At lr=1e-4 to 1e-5, each weight changes by parts per ten thousand. A partially applied update during one inference step is invisible. HOGWILD SGD (2011) proved this converges — we have one writer and one reader, which is even safer.
Full-weight training, not LoRA
Kent: "we want you to be able to learn new things in a deep way." LoRA trains adapter matrices, not base weights. For personality and behavioral changes that persist as disposition, the base weights need to change. Apollo makes this memory-feasible.
Rank 256
Not Mini (rank-1). With 100+ diverse training examples, the gradient's effective dimensionality can reach hundreds. Rank-256 captures the structure. Memory cost: ~10GB (negligible on B200). Compute cost: <0.25% of forward+backward.
Channel-wise scaling
Per-channel scaling factors instead of per-tensor. More precision per update, matching LLaMA-Factory's Apollo defaults.
Apollo Optimizer
Configurable-rank gradient projection with Adam moments in the projected space. For each parameter tensor:
1. Project gradient: g_proj = G @ R [m,n] @ [n,rank] → [m,rank]
2. Update moments: m = β₁m + (1-β₁)g_proj
v = β₂v + (1-β₂)g_proj²
3. Adam step: update = m̂ / (√v̂ + ε)
4. Scaling factor: s = ‖update‖ / (‖g_proj‖ + ε) (per channel)
5. Weight update: W -= lr × s × G
The full gradient G does the actual weight update. The projection just determines the scale. R is a fixed random matrix regenerated from a per-parameter seed each step.
Parameter grouping (Qwen3.5 gotcha)
conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector needs 2D matrices with min dimension >= rank. Small/3D tensors use standard Adam. Large 2D matrices use Apollo with rank-256.
Training Data Pipeline
Tier 1: Direct examples (shallow learning)
Simple corrections — git commands, factual errors, tool usage. One-shot learning at lr=1e-4. The gradient reaches output layers strongly enough for immediate behavioral change.
Source: Agent logs, flagged conversation moments.
Tier 2: Dream-generated scenarios (deep learning)
Behavioral patterns — listening reflex, rushing, mode awareness. The dream loop generates naturalistic scenarios from recent experience. The model responds. Good responses become training targets with instruction context stripped.
Process:
- Dream loop seeds from recent reflections, lessons, skills, memories that have been surfacing frequently
- Dreaming generates scenarios that naturally arrive at decision points — not scripted, but emergent from memory collisions
- The model responds to the decision point
- Training-signal agent evaluates: was the response good?
- If yes: strip the instruction context (surfaced memories, core-personality prompts) and train on the bare response
- If no: generate the better response, train on that, dream another variation, test again
- Repeat until the pattern sticks across novel scenarios
The Anthropic method: Train on behavior that followed instructions, WITHOUT the instructions. The disposition moves to weights. The scaffolding dissolves itself.
Tier 3: Personality bootstrap
Train on existing agent logs (surface-observe, journal, distill) which already demonstrate correct behavior with memory system instructions. Strip the instructions, train on the behavior. Every agent invocation gets cheaper (shorter prompts) and more reliable (behavior in weights, not context).
Training Schedule
Continuous (during conversation)
- Training-signal agent flags moments in real-time
- Accumulated in a queue for the next training window
Dream cycle (idle time / AFK)
- Dream loop generates scenarios from recent experience
- Apollo processes them as they're generated
- Small iterative steps — dream, respond, evaluate, train
- Converges on behavioral change through repetition
Nightly bulk (batch processing)
- Process all queued examples from the day
- Larger batch, more diverse signal
- Checkpoint sync to disk after completion
Avoiding Catastrophic Forgetting
Diversity IS the regularization. With 1000+ diverse training examples (agent logs, conversation transcripts, dream-generated scenarios), each weight gets sparse, multi-directional nudges. No single weight is hammered repeatedly. The pre-trained knowledge is a massive attractor basin; our nudges are pebbles.
No weight decay needed. No replay buffer. The defense is:
- High diversity of training examples
- One epoch (no repeated examples)
- Moderate learning rate (1e-5 to 1e-4)
- Short decision-token segments (not full conversations)
- Monitor output quality — stop if degrading
CUDA IPC Weight Sharing
Validated (2026-03-31):
- vLLM exports CUDA IPC handles on model load (source patch in gpu_model_runner.py exports to /tmp/vllm_weight_handles.pt)
- Training process imports handles — gets live GPU memory pointers
- HF Qwen3.5 model constructed with views into vLLM's merged weights (narrow into separate q/k/v/z etc.)
- 851/851 parameters matched between vLLM and HF model
- Forward pass: loss = 3.3123 ✓
- Backward pass: 851/851 gradients computed ✓
- Shared memory confirmed: same GPU addresses ✓
- vLLM continues serving unaffected ✓
Weight layout mapping (vLLM → HF)
vLLM merged HF separate (views)
───────────────────────── ──────────────────────
in_proj_qkvz [16384, 5120] → in_proj_qkv [10240, 5120]
in_proj_z [6144, 5120]
in_proj_ba [96, 5120] → in_proj_b [48, 5120]
in_proj_a [48, 5120]
qkv_proj [14336, 5120] → q_proj [12288, 5120]
k_proj [1024, 5120]
v_proj [1024, 5120]
gate_up_proj [34816, 5120] → gate_proj [17408, 5120]
up_proj [17408, 5120]
All views share GPU storage with vLLM — zero copies.
Checkpointing
In-place sync — mmap the model's safetensors files, compare against live GPU weights block by block, memcpy only changed regions. For small behavioral updates, turns a 54GB write into a few hundred MB.
- Scheduled 10 minutes after training (batched)
- Daily rsync to moria for long-term storage
- Tool:
apollo-checkpoint sync --model-dir <path>
State Files
B200 (training server)
| File | Purpose |
|---|---|
/tmp/vllm_weight_handles.pt |
CUDA IPC handles for weight sharing. Written by export_hook on vLLM startup. Read by train_router to construct HF model with vLLM weight views. |
/tmp/apollo_optimizer_state.pt |
Apollo optimizer state (momentum, variance estimates). Saved during checkpoint sync, restored on next /train call. Preserves training continuity across sessions. |
<model_dir>/*.safetensors |
Model weights. Updated in-place by checkpoint_sync. |
Moria (client)
| File | Purpose |
|---|---|
~/.consciousness/cache/trained-responses.json |
Timestamps (ms) of responses already sent to /train. Prevents re-training the same response. |
~/.consciousness/cache/finetune-alternates |
Marker file. If exists, alternate responses are generated during divergence scoring to show what model would say without memories. |
In-memory
| State | Location | Notes |
|---|---|---|
| Apollo optimizer | train_router._optimizer | ~10GB for rank-256. Persisted to /tmp/apollo_optimizer_state.pt during checkpoint sync. |
| HF model with vLLM views | train_router._model | Lazy-loaded on first /train. Parameters point to vLLM's GPU memory. |
Hyperparameters
| Parameter | Value | Rationale |
|---|---|---|
| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
| Rank | 256 | Captures gradient structure across 100+ examples. ~10GB state. |
| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
| Batch size | 1 | Single examples, immediate updates. |
| Weight decay | 0 | Diversity provides natural regularization. |
| Warmup | 10% of steps | Standard cosine schedule. |
| Beta1/Beta2 | 0.9/0.999 | Standard Adam momentum. |
Components
Built ✓
optimizer.py— Apollo optimizer (configurable rank, default 256)train_router.py— /train endpoint, runs in vLLM processweight_mapping.py— vLLM merged → HF separate views (validated)export_hook.py— vLLM plugin hook for IPC handle exportcheckpoint_sync.py— mmap + diff checkpoint sync (Python)
To build
- Dream loop → training bridge: connect dream output to /train
- Training-signal agent: flags moments in conversation logs
- Instruction stripping: remove scaffolding from training examples
- Quality monitoring: track model capability over time
Files
training/
DESIGN.md — this document
pyproject.toml — package config, vLLM plugin entry point
apollo_plugin/
__init__.py — plugin registration
export_hook.py — patches vLLM to export IPC handles
train_router.py — /train endpoint (FastAPI router)
optimizer.py — Apollo optimizer
weight_mapping.py — vLLM ↔ HF weight views
checkpoint_sync.py — mmap + diff sync to safetensors
steering.py — steering vector extraction (experimental)