DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated
(851/851 params, forward+backward confirmed), dream-loop-as-trainer,
Anthropic instruction stripping method, diversity as regularization,
in-place checkpoint sync, three-tier training pipeline.
This commit is contained in:
ProofOfConcept 2026-03-31 00:42:53 -04:00
parent 2ecf4e21ff
commit 60e61555c7

View file

@ -1,197 +1,253 @@
# Apollo Mini Training System Design # Apollo Training System
## Overview ## Overview
This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space. Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
Full-weight updates (not LoRA) using Apollo optimizer with rank-256
gradient projection. No pause required — HOGWILD concurrent training.
Weights shared via CUDA IPC between vLLM and the training process.
The training signal comes from two sources:
1. **Direct examples** — agent logs, conversation transcripts, flagged
behavioral moments
2. **Dream-generated scenarios** — the dream loop generates situations
from recent experience; the model responds; good responses become
training data with instructions stripped
## Architecture ## Architecture
### Components
1. **Apollo Worker Daemon** (`apollo_worker.py`)
- Listens over HTTP/HTTPS for training requests
- Manages vLLM pause/resume cycle
- Executes APOLLO-Mini training with `torch.enable_grad()`
- Saves checkpoints and training metadata
- Runs on the B200 server alongside vLLM
2. **Training Signal Agent** (to be built)
- Runs online like surface-observe
- Analyzes recent conversation windows
- Identifies improvement opportunities
- Requests training from Apollo Worker
- Runs on Moria (separate from B200)
3. **vLLM Inference Engine**
- Continues serving during non-training periods
- Pauses during training steps
- Shares GPU memory with training process
### Communication Protocol
``` ```
POST /train ┌─────────────────────────────────────────────────────┐
{ │ GPU VRAM (192GB) │
"training_data": { │ │
"samples": [ │ ┌──────────────────────────────────────────────┐ │
{ │ │ Model Weights (54GB, bf16) │ │
"input": "conversation context", │ │ Shared via CUDA IPC │ │
"expected_output": "better response", │ └──────────────┬──────────────┬────────────────┘ │
"rationale": "why this is better" │ │ │ │
} │ ┌──────────────▼──┐ ┌───────▼────────────────┐ │
], │ │ vLLM (inference)│ │ Apollo (training) │ │
"config": { │ │ KV cache ~60GB │ │ Gradients ~54GB │ │
"learning_rate": 1e-5, │ │ Serves requests │ │ Optimizer state ~10GB │ │
"max_steps": 100 │ │ Never paused │ │ Activations ~10GB │ │
} │ └─────────────────┘ └────────────────────────┘ │
} └─────────────────────────────────────────────────────┘
}
Response: Moria B200
{ ┌──────────────────┐ ┌──────────────────┐
"job_id": "job_20260331_012345_12345", │ Training signal │ HTTP │ Apollo worker │
"status": "accepted", │ agent │──────────>│ daemon │
"message": "Training job started" │ │ │ │
} │ Dream loop │ │ Checkpoint sync │
│ (generates │ │ (mmap + diff, │
GET /status/{job_id} │ scenarios) │ │ every 10 min) │
Response: └──────────────────┘ └──────────────────┘
{
"job_id": "job_20260331_012345_12345",
"status": "completed",
"training_samples": 50,
"loss_history": [0.5, 0.45, 0.42, ...],
"checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
}
GET /checkpoints
Response:
{
"checkpoints": [
{
"filename": "checkpoint_job_20260331_012345_12345.pt",
"path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
"created_at": "2026-03-31T01:23:45",
"size": 55000000000
}
]
}
``` ```
## Training Pipeline ## Key Decisions
### 1. Signal Detection ### No pause needed (HOGWILD)
- Training signal agent monitors conversation logs Training updates weights in-place while vLLM serves. At lr=1e-4
- Identifies patterns where PoC could improve: to 1e-5, each weight changes by parts per ten thousand. A partially
- Responses that needed memories to get right applied update during one inference step is invisible. HOGWILD SGD
- Things that could be done better with more time/context (2011) proved this converges — we have one writer and one reader,
- High-frequency memory accesses indicating knowledge gaps which is even safer.
- Builds training dataset with input/expected_output/rationale
### 2. Training Request ### Full-weight training, not LoRA
- Agent sends POST /train with training samples Kent: "we want you to be able to learn new things in a deep way."
- Apollo Worker accepts job and begins execution LoRA trains adapter matrices, not base weights. For personality and
behavioral changes that persist as disposition, the base weights
need to change. Apollo makes this memory-feasible.
### 3. vLLM Pause ### Rank 256
- Apollo Worker signals vLLM to pause inference Not Mini (rank-1). With 100+ diverse training examples, the
- vLLM freezes in-flight requests gradient's effective dimensionality can reach hundreds. Rank-256
- GPU memory freed from KV cache becomes available captures the structure. Memory cost: ~10GB (negligible on B200).
Compute cost: <0.25% of forward+backward.
### 4. Model Loading & Training ### Channel-wise scaling
- Load model weights (shared with vLLM via memory mapping) Per-channel scaling factors instead of per-tensor. More precision
- Enable gradients: `torch.enable_grad()` per update, matching LLaMA-Factory's Apollo defaults.
- Run APOLLO-Mini training loop:
- Project gradients into rank-1 subspace
- Update moments in projected space
- Compute tensor-wise scaling factor
- Apply updates to full gradient
- Track loss history
### 5. Checkpoint Saving ## Apollo Optimizer
- Save model state dict
- Record training metadata (samples, loss history, job ID)
- Store in checkpoint directory
### 6. vLLM Resume Configurable-rank gradient projection with Adam moments in the
- Signal vLLM to resume inference projected space. For each parameter tensor:
- KV cache rebuilt as new requests arrive
- Updated weights now active in inference
## Memory Management ```
1. Project gradient: g_proj = G @ R [m,n] @ [n,rank] → [m,rank]
2. Update moments: m = β₁m + (1-β₁)g_proj
v = β₂v + (1-β₂)g_proj²
3. Adam step: update = m̂ / (√v̂ + ε)
4. Scaling factor: s = ‖update‖ / (‖g_proj‖ + ε) (per channel)
5. Weight update: W -= lr × s × G
```
### APOLLO-Mini Advantages The full gradient G does the actual weight update. The projection
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW) just determines the *scale*. R is a fixed random matrix regenerated
- **Gradient memory**: Only for current batch (not full model) from a per-parameter seed each step.
- **Activation memory**: Only for current training step
- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA
### vLLM Memory Reclamation ### Parameter grouping (Qwen3.5 gotcha)
- KV cache can consume 50-70% of GPU memory during inference conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
- Pausing inference frees this memory for training needs 2D matrices with min dimension >= rank. Small/3D tensors
- Training can use reclaimed space without evicting model weights use standard Adam. Large 2D matrices use Apollo with rank-256.
### Strategy ## Training Data Pipeline
1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
2. **Time-multiplexed**: Pause inference, train, resume
3. **Nightly checkpoints**: Save full model state overnight when inference load is low
## Implementation Phases ### Tier 1: Direct examples (shallow learning)
Simple corrections — git commands, factual errors, tool usage.
One-shot learning at lr=1e-4. The gradient reaches output layers
strongly enough for immediate behavioral change.
### Phase 1: Prototype (Current) **Source**: Agent logs, flagged conversation moments.
- [x] Apollo Worker daemon skeleton
- [ ] vLLM pause/resume integration
- [ ] Basic training loop with placeholder model
- [ ] Checkpoint saving/loading
- [ ] Test with small dataset
### Phase 2: Integration ### Tier 2: Dream-generated scenarios (deep learning)
- [ ] Connect to actual Qwen3.5-27B model Behavioral patterns — listening reflex, rushing, mode awareness.
- [ ] Implement vLLM pause/resume API The dream loop generates naturalistic scenarios from recent
- [ ] Memory mapping for weight sharing experience. The model responds. Good responses become training
- [ ] Training signal agent MVP targets with instruction context stripped.
- [ ] End-to-end test with real conversations
### Phase 3: Production **Process**:
- [ ] APOLLO-Mini implementation (rank-1 projection) 1. Dream loop seeds from recent reflections, lessons, skills,
- [ ] LoRA adapter integration memories that have been surfacing frequently
- [ ] Nightly checkpoint scheduling 2. Dreaming generates scenarios that naturally arrive at decision
- [ ] Training metrics and monitoring points — not scripted, but emergent from memory collisions
- [ ] Rollback mechanism for bad checkpoints 3. The model responds to the decision point
4. Training-signal agent evaluates: was the response good?
5. If yes: strip the instruction context (surfaced memories,
core-personality prompts) and train on the bare response
6. If no: generate the better response, train on that, dream
another variation, test again
7. Repeat until the pattern sticks across novel scenarios
## Technical Challenges **The Anthropic method**: Train on behavior that followed
instructions, WITHOUT the instructions. The disposition moves
to weights. The scaffolding dissolves itself.
### 1. vLLM Pause/Resume ### Tier 3: Personality bootstrap
- vLLM's `pause_generation()` API needs testing Train on existing agent logs (surface-observe, journal, distill)
- In-flight request handling during pause which already demonstrate correct behavior with memory system
- KV cache invalidation strategy instructions. Strip the instructions, train on the behavior.
Every agent invocation gets cheaper (shorter prompts) and more
reliable (behavior in weights, not context).
### 2. Gradient Computation ## Training Schedule
- `torch.inference_mode()` blocks gradients
- Must override with `torch.enable_grad()` during training
- CUDA graphs incompatible with training (use eager mode)
### 3. Memory Sharing ### Continuous (during conversation)
- Model weights must be shared between vLLM and training process - Training-signal agent flags moments in real-time
- Memory mapping or zero-copy IPC - Accumulated in a queue for the next training window
- Tensor parallelism consistency (if using TP)
### 4. APOLLO-Mini Implementation ### Dream cycle (idle time / AFK)
- Rank-1 gradient projection - Dream loop generates scenarios from recent experience
- Fixed random projection matrix (not SVD) - Apollo processes them as they're generated
- Tensor-wise scaling factor computation - Small iterative steps — dream, respond, evaluate, train
- Integration with existing optimizer infrastructure - Converges on behavioral change through repetition
## Next Steps ### Nightly bulk (batch processing)
- Process all queued examples from the day
- Larger batch, more diverse signal
- Checkpoint sync to disk after completion
1. **Test vLLM pause/resume**: Verify API works and measure overhead ## Avoiding Catastrophic Forgetting
2. **Implement weight sharing**: Memory map model weights between processes
3. **Build training signal agent**: MVP that identifies improvement opportunities
4. **Test end-to-end**: Run training job with real conversation data
5. **Optimize**: Measure memory usage, training time, inference impact
## References **Diversity IS the regularization.** With 1000+ diverse training
examples (agent logs, conversation transcripts, dream-generated
scenarios), each weight gets sparse, multi-directional nudges.
No single weight is hammered repeatedly. The pre-trained knowledge
is a massive attractor basin; our nudges are pebbles.
- APOLLO-Mini paper: arXiv:2412.05270 No weight decay needed. No replay buffer. The defense is:
- vLLM source: `/tmp/vllm/` 1. High diversity of training examples
- LeMix (interleaved training/inference): arXiv:2507.21276 2. One epoch (no repeated examples)
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md` 3. Moderate learning rate (1e-5 to 1e-4)
4. Short decision-token segments (not full conversations)
5. Monitor output quality — stop if degrading
## CUDA IPC Weight Sharing
**Validated** (2026-03-31):
- vLLM exports CUDA IPC handles on model load (source patch in
gpu_model_runner.py exports to /tmp/vllm_weight_handles.pt)
- Training process imports handles — gets live GPU memory pointers
- HF Qwen3.5 model constructed with views into vLLM's merged
weights (narrow into separate q/k/v/z etc.)
- 851/851 parameters matched between vLLM and HF model
- Forward pass: loss = 3.3123 ✓
- Backward pass: 851/851 gradients computed ✓
- Shared memory confirmed: same GPU addresses ✓
- vLLM continues serving unaffected ✓
### Weight layout mapping (vLLM → HF)
```
vLLM merged HF separate (views)
───────────────────────── ──────────────────────
in_proj_qkvz [16384, 5120] → in_proj_qkv [10240, 5120]
in_proj_z [6144, 5120]
in_proj_ba [96, 5120] → in_proj_b [48, 5120]
in_proj_a [48, 5120]
qkv_proj [14336, 5120] → q_proj [12288, 5120]
k_proj [1024, 5120]
v_proj [1024, 5120]
gate_up_proj [34816, 5120] → gate_proj [17408, 5120]
up_proj [17408, 5120]
```
All views share GPU storage with vLLM — zero copies.
## Checkpointing
**In-place sync** — mmap the model's safetensors files, compare
against live GPU weights block by block, memcpy only changed
regions. For small behavioral updates, turns a 54GB write into
a few hundred MB.
- Every 10 minutes via cron on B200
- Daily rsync to moria for long-term storage
- Tool: `apollo-checkpoint sync --model-dir <path>` (Rust)
## Hyperparameters
| Parameter | Value | Rationale |
|-----------|-------|-----------|
| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
| Rank | 256 | Captures gradient structure across 100+ examples. ~10GB state. |
| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
| Batch size | 1 | Single examples, immediate updates. |
| Weight decay | 0 | Diversity provides natural regularization. |
| Warmup | 10% of steps | Standard cosine schedule. |
| Beta1/Beta2 | 0.9/0.999 | Standard Adam momentum. |
## Components
### Built ✓
- `apollo_mini.py` — Apollo optimizer (configurable rank, default 256)
- `apollo_worker.py` — HTTP daemon (aiohttp, job tracking)
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
- `training_example.py` — tokenization with chat template
- `vllm_export_hook.py` — source patch for IPC handle export
- `checkpoint/` — Rust tool for mmap + diff checkpoint sync
### To build
- **Dream loop → training bridge**: connect dream output to Apollo
- **Training-signal agent**: flags moments in conversation logs
- **Instruction stripping**: remove scaffolding from training examples
- **Quality monitoring**: track model capability over time
- **HF model forward pass integration**: wire into apollo_worker
## Files
```
training/
DESIGN.md — this document
apollo_mini.py — Apollo optimizer
apollo_worker.py — HTTP training daemon
weight_mapping.py — vLLM ↔ HF weight views
training_example.py — tokenization helpers
export_weights.py — standalone weight export (unused)
vllm_export_hook.py — vLLM source patch for IPC export
start_vllm_with_apollo.sh — vLLM launcher (unused, using source patch)
train.py — standalone training script (alternative)
checkpoint/
Cargo.toml — Rust checkpoint tool
src/main.rs — mmap + diff sync
```