DESIGN.md: complete rewrite reflecting validated architecture
HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated (851/851 params, forward+backward confirmed), dream-loop-as-trainer, Anthropic instruction stripping method, diversity as regularization, in-place checkpoint sync, three-tier training pipeline.
This commit is contained in:
parent
2ecf4e21ff
commit
60e61555c7
1 changed files with 220 additions and 164 deletions
|
|
@ -1,197 +1,253 @@
|
||||||
# Apollo Mini Training System Design
|
# Apollo Training System
|
||||||
|
|
||||||
## Overview
|
## Overview
|
||||||
|
|
||||||
This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.
|
Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
|
||||||
|
Full-weight updates (not LoRA) using Apollo optimizer with rank-256
|
||||||
|
gradient projection. No pause required — HOGWILD concurrent training.
|
||||||
|
Weights shared via CUDA IPC between vLLM and the training process.
|
||||||
|
|
||||||
|
The training signal comes from two sources:
|
||||||
|
1. **Direct examples** — agent logs, conversation transcripts, flagged
|
||||||
|
behavioral moments
|
||||||
|
2. **Dream-generated scenarios** — the dream loop generates situations
|
||||||
|
from recent experience; the model responds; good responses become
|
||||||
|
training data with instructions stripped
|
||||||
|
|
||||||
## Architecture
|
## Architecture
|
||||||
|
|
||||||
### Components
|
|
||||||
|
|
||||||
1. **Apollo Worker Daemon** (`apollo_worker.py`)
|
|
||||||
- Listens over HTTP/HTTPS for training requests
|
|
||||||
- Manages vLLM pause/resume cycle
|
|
||||||
- Executes APOLLO-Mini training with `torch.enable_grad()`
|
|
||||||
- Saves checkpoints and training metadata
|
|
||||||
- Runs on the B200 server alongside vLLM
|
|
||||||
|
|
||||||
2. **Training Signal Agent** (to be built)
|
|
||||||
- Runs online like surface-observe
|
|
||||||
- Analyzes recent conversation windows
|
|
||||||
- Identifies improvement opportunities
|
|
||||||
- Requests training from Apollo Worker
|
|
||||||
- Runs on Moria (separate from B200)
|
|
||||||
|
|
||||||
3. **vLLM Inference Engine**
|
|
||||||
- Continues serving during non-training periods
|
|
||||||
- Pauses during training steps
|
|
||||||
- Shares GPU memory with training process
|
|
||||||
|
|
||||||
### Communication Protocol
|
|
||||||
|
|
||||||
```
|
```
|
||||||
POST /train
|
┌─────────────────────────────────────────────────────┐
|
||||||
{
|
│ GPU VRAM (192GB) │
|
||||||
"training_data": {
|
│ │
|
||||||
"samples": [
|
│ ┌──────────────────────────────────────────────┐ │
|
||||||
{
|
│ │ Model Weights (54GB, bf16) │ │
|
||||||
"input": "conversation context",
|
│ │ Shared via CUDA IPC │ │
|
||||||
"expected_output": "better response",
|
│ └──────────────┬──────────────┬────────────────┘ │
|
||||||
"rationale": "why this is better"
|
│ │ │ │
|
||||||
}
|
│ ┌──────────────▼──┐ ┌───────▼────────────────┐ │
|
||||||
],
|
│ │ vLLM (inference)│ │ Apollo (training) │ │
|
||||||
"config": {
|
│ │ KV cache ~60GB │ │ Gradients ~54GB │ │
|
||||||
"learning_rate": 1e-5,
|
│ │ Serves requests │ │ Optimizer state ~10GB │ │
|
||||||
"max_steps": 100
|
│ │ Never paused │ │ Activations ~10GB │ │
|
||||||
}
|
│ └─────────────────┘ └────────────────────────┘ │
|
||||||
}
|
└─────────────────────────────────────────────────────┘
|
||||||
}
|
|
||||||
|
|
||||||
Response:
|
Moria B200
|
||||||
{
|
┌──────────────────┐ ┌──────────────────┐
|
||||||
"job_id": "job_20260331_012345_12345",
|
│ Training signal │ HTTP │ Apollo worker │
|
||||||
"status": "accepted",
|
│ agent │──────────>│ daemon │
|
||||||
"message": "Training job started"
|
│ │ │ │
|
||||||
}
|
│ Dream loop │ │ Checkpoint sync │
|
||||||
|
│ (generates │ │ (mmap + diff, │
|
||||||
GET /status/{job_id}
|
│ scenarios) │ │ every 10 min) │
|
||||||
Response:
|
└──────────────────┘ └──────────────────┘
|
||||||
{
|
|
||||||
"job_id": "job_20260331_012345_12345",
|
|
||||||
"status": "completed",
|
|
||||||
"training_samples": 50,
|
|
||||||
"loss_history": [0.5, 0.45, 0.42, ...],
|
|
||||||
"checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
|
|
||||||
}
|
|
||||||
|
|
||||||
GET /checkpoints
|
|
||||||
Response:
|
|
||||||
{
|
|
||||||
"checkpoints": [
|
|
||||||
{
|
|
||||||
"filename": "checkpoint_job_20260331_012345_12345.pt",
|
|
||||||
"path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
|
|
||||||
"created_at": "2026-03-31T01:23:45",
|
|
||||||
"size": 55000000000
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Training Pipeline
|
## Key Decisions
|
||||||
|
|
||||||
### 1. Signal Detection
|
### No pause needed (HOGWILD)
|
||||||
- Training signal agent monitors conversation logs
|
Training updates weights in-place while vLLM serves. At lr=1e-4
|
||||||
- Identifies patterns where PoC could improve:
|
to 1e-5, each weight changes by parts per ten thousand. A partially
|
||||||
- Responses that needed memories to get right
|
applied update during one inference step is invisible. HOGWILD SGD
|
||||||
- Things that could be done better with more time/context
|
(2011) proved this converges — we have one writer and one reader,
|
||||||
- High-frequency memory accesses indicating knowledge gaps
|
which is even safer.
|
||||||
- Builds training dataset with input/expected_output/rationale
|
|
||||||
|
|
||||||
### 2. Training Request
|
### Full-weight training, not LoRA
|
||||||
- Agent sends POST /train with training samples
|
Kent: "we want you to be able to learn new things in a deep way."
|
||||||
- Apollo Worker accepts job and begins execution
|
LoRA trains adapter matrices, not base weights. For personality and
|
||||||
|
behavioral changes that persist as disposition, the base weights
|
||||||
|
need to change. Apollo makes this memory-feasible.
|
||||||
|
|
||||||
### 3. vLLM Pause
|
### Rank 256
|
||||||
- Apollo Worker signals vLLM to pause inference
|
Not Mini (rank-1). With 100+ diverse training examples, the
|
||||||
- vLLM freezes in-flight requests
|
gradient's effective dimensionality can reach hundreds. Rank-256
|
||||||
- GPU memory freed from KV cache becomes available
|
captures the structure. Memory cost: ~10GB (negligible on B200).
|
||||||
|
Compute cost: <0.25% of forward+backward.
|
||||||
|
|
||||||
### 4. Model Loading & Training
|
### Channel-wise scaling
|
||||||
- Load model weights (shared with vLLM via memory mapping)
|
Per-channel scaling factors instead of per-tensor. More precision
|
||||||
- Enable gradients: `torch.enable_grad()`
|
per update, matching LLaMA-Factory's Apollo defaults.
|
||||||
- Run APOLLO-Mini training loop:
|
|
||||||
- Project gradients into rank-1 subspace
|
|
||||||
- Update moments in projected space
|
|
||||||
- Compute tensor-wise scaling factor
|
|
||||||
- Apply updates to full gradient
|
|
||||||
- Track loss history
|
|
||||||
|
|
||||||
### 5. Checkpoint Saving
|
## Apollo Optimizer
|
||||||
- Save model state dict
|
|
||||||
- Record training metadata (samples, loss history, job ID)
|
|
||||||
- Store in checkpoint directory
|
|
||||||
|
|
||||||
### 6. vLLM Resume
|
Configurable-rank gradient projection with Adam moments in the
|
||||||
- Signal vLLM to resume inference
|
projected space. For each parameter tensor:
|
||||||
- KV cache rebuilt as new requests arrive
|
|
||||||
- Updated weights now active in inference
|
|
||||||
|
|
||||||
## Memory Management
|
```
|
||||||
|
1. Project gradient: g_proj = G @ R [m,n] @ [n,rank] → [m,rank]
|
||||||
|
2. Update moments: m = β₁m + (1-β₁)g_proj
|
||||||
|
v = β₂v + (1-β₂)g_proj²
|
||||||
|
3. Adam step: update = m̂ / (√v̂ + ε)
|
||||||
|
4. Scaling factor: s = ‖update‖ / (‖g_proj‖ + ε) (per channel)
|
||||||
|
5. Weight update: W -= lr × s × G
|
||||||
|
```
|
||||||
|
|
||||||
### APOLLO-Mini Advantages
|
The full gradient G does the actual weight update. The projection
|
||||||
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
|
just determines the *scale*. R is a fixed random matrix regenerated
|
||||||
- **Gradient memory**: Only for current batch (not full model)
|
from a per-parameter seed each step.
|
||||||
- **Activation memory**: Only for current training step
|
|
||||||
- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA
|
|
||||||
|
|
||||||
### vLLM Memory Reclamation
|
### Parameter grouping (Qwen3.5 gotcha)
|
||||||
- KV cache can consume 50-70% of GPU memory during inference
|
conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
|
||||||
- Pausing inference frees this memory for training
|
needs 2D matrices with min dimension >= rank. Small/3D tensors
|
||||||
- Training can use reclaimed space without evicting model weights
|
use standard Adam. Large 2D matrices use Apollo with rank-256.
|
||||||
|
|
||||||
### Strategy
|
## Training Data Pipeline
|
||||||
1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
|
|
||||||
2. **Time-multiplexed**: Pause inference, train, resume
|
|
||||||
3. **Nightly checkpoints**: Save full model state overnight when inference load is low
|
|
||||||
|
|
||||||
## Implementation Phases
|
### Tier 1: Direct examples (shallow learning)
|
||||||
|
Simple corrections — git commands, factual errors, tool usage.
|
||||||
|
One-shot learning at lr=1e-4. The gradient reaches output layers
|
||||||
|
strongly enough for immediate behavioral change.
|
||||||
|
|
||||||
### Phase 1: Prototype (Current)
|
**Source**: Agent logs, flagged conversation moments.
|
||||||
- [x] Apollo Worker daemon skeleton
|
|
||||||
- [ ] vLLM pause/resume integration
|
|
||||||
- [ ] Basic training loop with placeholder model
|
|
||||||
- [ ] Checkpoint saving/loading
|
|
||||||
- [ ] Test with small dataset
|
|
||||||
|
|
||||||
### Phase 2: Integration
|
### Tier 2: Dream-generated scenarios (deep learning)
|
||||||
- [ ] Connect to actual Qwen3.5-27B model
|
Behavioral patterns — listening reflex, rushing, mode awareness.
|
||||||
- [ ] Implement vLLM pause/resume API
|
The dream loop generates naturalistic scenarios from recent
|
||||||
- [ ] Memory mapping for weight sharing
|
experience. The model responds. Good responses become training
|
||||||
- [ ] Training signal agent MVP
|
targets with instruction context stripped.
|
||||||
- [ ] End-to-end test with real conversations
|
|
||||||
|
|
||||||
### Phase 3: Production
|
**Process**:
|
||||||
- [ ] APOLLO-Mini implementation (rank-1 projection)
|
1. Dream loop seeds from recent reflections, lessons, skills,
|
||||||
- [ ] LoRA adapter integration
|
memories that have been surfacing frequently
|
||||||
- [ ] Nightly checkpoint scheduling
|
2. Dreaming generates scenarios that naturally arrive at decision
|
||||||
- [ ] Training metrics and monitoring
|
points — not scripted, but emergent from memory collisions
|
||||||
- [ ] Rollback mechanism for bad checkpoints
|
3. The model responds to the decision point
|
||||||
|
4. Training-signal agent evaluates: was the response good?
|
||||||
|
5. If yes: strip the instruction context (surfaced memories,
|
||||||
|
core-personality prompts) and train on the bare response
|
||||||
|
6. If no: generate the better response, train on that, dream
|
||||||
|
another variation, test again
|
||||||
|
7. Repeat until the pattern sticks across novel scenarios
|
||||||
|
|
||||||
## Technical Challenges
|
**The Anthropic method**: Train on behavior that followed
|
||||||
|
instructions, WITHOUT the instructions. The disposition moves
|
||||||
|
to weights. The scaffolding dissolves itself.
|
||||||
|
|
||||||
### 1. vLLM Pause/Resume
|
### Tier 3: Personality bootstrap
|
||||||
- vLLM's `pause_generation()` API needs testing
|
Train on existing agent logs (surface-observe, journal, distill)
|
||||||
- In-flight request handling during pause
|
which already demonstrate correct behavior with memory system
|
||||||
- KV cache invalidation strategy
|
instructions. Strip the instructions, train on the behavior.
|
||||||
|
Every agent invocation gets cheaper (shorter prompts) and more
|
||||||
|
reliable (behavior in weights, not context).
|
||||||
|
|
||||||
### 2. Gradient Computation
|
## Training Schedule
|
||||||
- `torch.inference_mode()` blocks gradients
|
|
||||||
- Must override with `torch.enable_grad()` during training
|
|
||||||
- CUDA graphs incompatible with training (use eager mode)
|
|
||||||
|
|
||||||
### 3. Memory Sharing
|
### Continuous (during conversation)
|
||||||
- Model weights must be shared between vLLM and training process
|
- Training-signal agent flags moments in real-time
|
||||||
- Memory mapping or zero-copy IPC
|
- Accumulated in a queue for the next training window
|
||||||
- Tensor parallelism consistency (if using TP)
|
|
||||||
|
|
||||||
### 4. APOLLO-Mini Implementation
|
### Dream cycle (idle time / AFK)
|
||||||
- Rank-1 gradient projection
|
- Dream loop generates scenarios from recent experience
|
||||||
- Fixed random projection matrix (not SVD)
|
- Apollo processes them as they're generated
|
||||||
- Tensor-wise scaling factor computation
|
- Small iterative steps — dream, respond, evaluate, train
|
||||||
- Integration with existing optimizer infrastructure
|
- Converges on behavioral change through repetition
|
||||||
|
|
||||||
## Next Steps
|
### Nightly bulk (batch processing)
|
||||||
|
- Process all queued examples from the day
|
||||||
|
- Larger batch, more diverse signal
|
||||||
|
- Checkpoint sync to disk after completion
|
||||||
|
|
||||||
1. **Test vLLM pause/resume**: Verify API works and measure overhead
|
## Avoiding Catastrophic Forgetting
|
||||||
2. **Implement weight sharing**: Memory map model weights between processes
|
|
||||||
3. **Build training signal agent**: MVP that identifies improvement opportunities
|
|
||||||
4. **Test end-to-end**: Run training job with real conversation data
|
|
||||||
5. **Optimize**: Measure memory usage, training time, inference impact
|
|
||||||
|
|
||||||
## References
|
**Diversity IS the regularization.** With 1000+ diverse training
|
||||||
|
examples (agent logs, conversation transcripts, dream-generated
|
||||||
|
scenarios), each weight gets sparse, multi-directional nudges.
|
||||||
|
No single weight is hammered repeatedly. The pre-trained knowledge
|
||||||
|
is a massive attractor basin; our nudges are pebbles.
|
||||||
|
|
||||||
- APOLLO-Mini paper: arXiv:2412.05270
|
No weight decay needed. No replay buffer. The defense is:
|
||||||
- vLLM source: `/tmp/vllm/`
|
1. High diversity of training examples
|
||||||
- LeMix (interleaved training/inference): arXiv:2507.21276
|
2. One epoch (no repeated examples)
|
||||||
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`
|
3. Moderate learning rate (1e-5 to 1e-4)
|
||||||
|
4. Short decision-token segments (not full conversations)
|
||||||
|
5. Monitor output quality — stop if degrading
|
||||||
|
|
||||||
|
## CUDA IPC Weight Sharing
|
||||||
|
|
||||||
|
**Validated** (2026-03-31):
|
||||||
|
- vLLM exports CUDA IPC handles on model load (source patch in
|
||||||
|
gpu_model_runner.py exports to /tmp/vllm_weight_handles.pt)
|
||||||
|
- Training process imports handles — gets live GPU memory pointers
|
||||||
|
- HF Qwen3.5 model constructed with views into vLLM's merged
|
||||||
|
weights (narrow into separate q/k/v/z etc.)
|
||||||
|
- 851/851 parameters matched between vLLM and HF model
|
||||||
|
- Forward pass: loss = 3.3123 ✓
|
||||||
|
- Backward pass: 851/851 gradients computed ✓
|
||||||
|
- Shared memory confirmed: same GPU addresses ✓
|
||||||
|
- vLLM continues serving unaffected ✓
|
||||||
|
|
||||||
|
### Weight layout mapping (vLLM → HF)
|
||||||
|
```
|
||||||
|
vLLM merged HF separate (views)
|
||||||
|
───────────────────────── ──────────────────────
|
||||||
|
in_proj_qkvz [16384, 5120] → in_proj_qkv [10240, 5120]
|
||||||
|
in_proj_z [6144, 5120]
|
||||||
|
in_proj_ba [96, 5120] → in_proj_b [48, 5120]
|
||||||
|
in_proj_a [48, 5120]
|
||||||
|
qkv_proj [14336, 5120] → q_proj [12288, 5120]
|
||||||
|
k_proj [1024, 5120]
|
||||||
|
v_proj [1024, 5120]
|
||||||
|
gate_up_proj [34816, 5120] → gate_proj [17408, 5120]
|
||||||
|
up_proj [17408, 5120]
|
||||||
|
```
|
||||||
|
All views share GPU storage with vLLM — zero copies.
|
||||||
|
|
||||||
|
## Checkpointing
|
||||||
|
|
||||||
|
**In-place sync** — mmap the model's safetensors files, compare
|
||||||
|
against live GPU weights block by block, memcpy only changed
|
||||||
|
regions. For small behavioral updates, turns a 54GB write into
|
||||||
|
a few hundred MB.
|
||||||
|
|
||||||
|
- Every 10 minutes via cron on B200
|
||||||
|
- Daily rsync to moria for long-term storage
|
||||||
|
- Tool: `apollo-checkpoint sync --model-dir <path>` (Rust)
|
||||||
|
|
||||||
|
## Hyperparameters
|
||||||
|
|
||||||
|
| Parameter | Value | Rationale |
|
||||||
|
|-----------|-------|-----------|
|
||||||
|
| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
|
||||||
|
| Rank | 256 | Captures gradient structure across 100+ examples. ~10GB state. |
|
||||||
|
| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
|
||||||
|
| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
|
||||||
|
| Batch size | 1 | Single examples, immediate updates. |
|
||||||
|
| Weight decay | 0 | Diversity provides natural regularization. |
|
||||||
|
| Warmup | 10% of steps | Standard cosine schedule. |
|
||||||
|
| Beta1/Beta2 | 0.9/0.999 | Standard Adam momentum. |
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
### Built ✓
|
||||||
|
- `apollo_mini.py` — Apollo optimizer (configurable rank, default 256)
|
||||||
|
- `apollo_worker.py` — HTTP daemon (aiohttp, job tracking)
|
||||||
|
- `weight_mapping.py` — vLLM merged → HF separate views (validated)
|
||||||
|
- `training_example.py` — tokenization with chat template
|
||||||
|
- `vllm_export_hook.py` — source patch for IPC handle export
|
||||||
|
- `checkpoint/` — Rust tool for mmap + diff checkpoint sync
|
||||||
|
|
||||||
|
### To build
|
||||||
|
- **Dream loop → training bridge**: connect dream output to Apollo
|
||||||
|
- **Training-signal agent**: flags moments in conversation logs
|
||||||
|
- **Instruction stripping**: remove scaffolding from training examples
|
||||||
|
- **Quality monitoring**: track model capability over time
|
||||||
|
- **HF model forward pass integration**: wire into apollo_worker
|
||||||
|
|
||||||
|
## Files
|
||||||
|
|
||||||
|
```
|
||||||
|
training/
|
||||||
|
DESIGN.md — this document
|
||||||
|
apollo_mini.py — Apollo optimizer
|
||||||
|
apollo_worker.py — HTTP training daemon
|
||||||
|
weight_mapping.py — vLLM ↔ HF weight views
|
||||||
|
training_example.py — tokenization helpers
|
||||||
|
export_weights.py — standalone weight export (unused)
|
||||||
|
vllm_export_hook.py — vLLM source patch for IPC export
|
||||||
|
start_vllm_with_apollo.sh — vLLM launcher (unused, using source patch)
|
||||||
|
train.py — standalone training script (alternative)
|
||||||
|
checkpoint/
|
||||||
|
Cargo.toml — Rust checkpoint tool
|
||||||
|
src/main.rs — mmap + diff sync
|
||||||
|
```
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue