From 60e61555c79c89e8ba4a5e7ab856d81c9a2acbb4 Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 00:42:53 -0400 Subject: [PATCH] DESIGN.md: complete rewrite reflecting validated architecture HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated (851/851 params, forward+backward confirmed), dream-loop-as-trainer, Anthropic instruction stripping method, diversity as regularization, in-place checkpoint sync, three-tier training pipeline. --- training/DESIGN.md | 384 ++++++++++++++++++++++++++------------------- 1 file changed, 220 insertions(+), 164 deletions(-) diff --git a/training/DESIGN.md b/training/DESIGN.md index 313daed..f966fa4 100644 --- a/training/DESIGN.md +++ b/training/DESIGN.md @@ -1,197 +1,253 @@ -# Apollo Mini Training System Design +# Apollo Training System ## Overview -This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space. +Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference. +Full-weight updates (not LoRA) using Apollo optimizer with rank-256 +gradient projection. No pause required — HOGWILD concurrent training. +Weights shared via CUDA IPC between vLLM and the training process. + +The training signal comes from two sources: +1. **Direct examples** — agent logs, conversation transcripts, flagged + behavioral moments +2. **Dream-generated scenarios** — the dream loop generates situations + from recent experience; the model responds; good responses become + training data with instructions stripped ## Architecture -### Components - -1. **Apollo Worker Daemon** (`apollo_worker.py`) - - Listens over HTTP/HTTPS for training requests - - Manages vLLM pause/resume cycle - - Executes APOLLO-Mini training with `torch.enable_grad()` - - Saves checkpoints and training metadata - - Runs on the B200 server alongside vLLM - -2. **Training Signal Agent** (to be built) - - Runs online like surface-observe - - Analyzes recent conversation windows - - Identifies improvement opportunities - - Requests training from Apollo Worker - - Runs on Moria (separate from B200) - -3. **vLLM Inference Engine** - - Continues serving during non-training periods - - Pauses during training steps - - Shares GPU memory with training process - -### Communication Protocol - ``` -POST /train -{ - "training_data": { - "samples": [ - { - "input": "conversation context", - "expected_output": "better response", - "rationale": "why this is better" - } - ], - "config": { - "learning_rate": 1e-5, - "max_steps": 100 - } - } -} +┌─────────────────────────────────────────────────────┐ +│ GPU VRAM (192GB) │ +│ │ +│ ┌──────────────────────────────────────────────┐ │ +│ │ Model Weights (54GB, bf16) │ │ +│ │ Shared via CUDA IPC │ │ +│ └──────────────┬──────────────┬────────────────┘ │ +│ │ │ │ +│ ┌──────────────▼──┐ ┌───────▼────────────────┐ │ +│ │ vLLM (inference)│ │ Apollo (training) │ │ +│ │ KV cache ~60GB │ │ Gradients ~54GB │ │ +│ │ Serves requests │ │ Optimizer state ~10GB │ │ +│ │ Never paused │ │ Activations ~10GB │ │ +│ └─────────────────┘ └────────────────────────┘ │ +└─────────────────────────────────────────────────────┘ -Response: -{ - "job_id": "job_20260331_012345_12345", - "status": "accepted", - "message": "Training job started" -} - -GET /status/{job_id} -Response: -{ - "job_id": "job_20260331_012345_12345", - "status": "completed", - "training_samples": 50, - "loss_history": [0.5, 0.45, 0.42, ...], - "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt" -} - -GET /checkpoints -Response: -{ - "checkpoints": [ - { - "filename": "checkpoint_job_20260331_012345_12345.pt", - "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt", - "created_at": "2026-03-31T01:23:45", - "size": 55000000000 - } - ] -} +Moria B200 +┌──────────────────┐ ┌──────────────────┐ +│ Training signal │ HTTP │ Apollo worker │ +│ agent │──────────>│ daemon │ +│ │ │ │ +│ Dream loop │ │ Checkpoint sync │ +│ (generates │ │ (mmap + diff, │ +│ scenarios) │ │ every 10 min) │ +└──────────────────┘ └──────────────────┘ ``` -## Training Pipeline +## Key Decisions -### 1. Signal Detection -- Training signal agent monitors conversation logs -- Identifies patterns where PoC could improve: - - Responses that needed memories to get right - - Things that could be done better with more time/context - - High-frequency memory accesses indicating knowledge gaps -- Builds training dataset with input/expected_output/rationale +### No pause needed (HOGWILD) +Training updates weights in-place while vLLM serves. At lr=1e-4 +to 1e-5, each weight changes by parts per ten thousand. A partially +applied update during one inference step is invisible. HOGWILD SGD +(2011) proved this converges — we have one writer and one reader, +which is even safer. -### 2. Training Request -- Agent sends POST /train with training samples -- Apollo Worker accepts job and begins execution +### Full-weight training, not LoRA +Kent: "we want you to be able to learn new things in a deep way." +LoRA trains adapter matrices, not base weights. For personality and +behavioral changes that persist as disposition, the base weights +need to change. Apollo makes this memory-feasible. -### 3. vLLM Pause -- Apollo Worker signals vLLM to pause inference -- vLLM freezes in-flight requests -- GPU memory freed from KV cache becomes available +### Rank 256 +Not Mini (rank-1). With 100+ diverse training examples, the +gradient's effective dimensionality can reach hundreds. Rank-256 +captures the structure. Memory cost: ~10GB (negligible on B200). +Compute cost: <0.25% of forward+backward. -### 4. Model Loading & Training -- Load model weights (shared with vLLM via memory mapping) -- Enable gradients: `torch.enable_grad()` -- Run APOLLO-Mini training loop: - - Project gradients into rank-1 subspace - - Update moments in projected space - - Compute tensor-wise scaling factor - - Apply updates to full gradient -- Track loss history +### Channel-wise scaling +Per-channel scaling factors instead of per-tensor. More precision +per update, matching LLaMA-Factory's Apollo defaults. -### 5. Checkpoint Saving -- Save model state dict -- Record training metadata (samples, loss history, job ID) -- Store in checkpoint directory +## Apollo Optimizer -### 6. vLLM Resume -- Signal vLLM to resume inference -- KV cache rebuilt as new requests arrive -- Updated weights now active in inference +Configurable-rank gradient projection with Adam moments in the +projected space. For each parameter tensor: -## Memory Management +``` +1. Project gradient: g_proj = G @ R [m,n] @ [n,rank] → [m,rank] +2. Update moments: m = β₁m + (1-β₁)g_proj + v = β₂v + (1-β₂)g_proj² +3. Adam step: update = m̂ / (√v̂ + ε) +4. Scaling factor: s = ‖update‖ / (‖g_proj‖ + ε) (per channel) +5. Weight update: W -= lr × s × G +``` -### APOLLO-Mini Advantages -- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW) -- **Gradient memory**: Only for current batch (not full model) -- **Activation memory**: Only for current training step -- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA +The full gradient G does the actual weight update. The projection +just determines the *scale*. R is a fixed random matrix regenerated +from a per-parameter seed each step. -### vLLM Memory Reclamation -- KV cache can consume 50-70% of GPU memory during inference -- Pausing inference frees this memory for training -- Training can use reclaimed space without evicting model weights +### Parameter grouping (Qwen3.5 gotcha) +conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector +needs 2D matrices with min dimension >= rank. Small/3D tensors +use standard Adam. Large 2D matrices use Apollo with rank-256. -### Strategy -1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16) -2. **Time-multiplexed**: Pause inference, train, resume -3. **Nightly checkpoints**: Save full model state overnight when inference load is low +## Training Data Pipeline -## Implementation Phases +### Tier 1: Direct examples (shallow learning) +Simple corrections — git commands, factual errors, tool usage. +One-shot learning at lr=1e-4. The gradient reaches output layers +strongly enough for immediate behavioral change. -### Phase 1: Prototype (Current) -- [x] Apollo Worker daemon skeleton -- [ ] vLLM pause/resume integration -- [ ] Basic training loop with placeholder model -- [ ] Checkpoint saving/loading -- [ ] Test with small dataset +**Source**: Agent logs, flagged conversation moments. -### Phase 2: Integration -- [ ] Connect to actual Qwen3.5-27B model -- [ ] Implement vLLM pause/resume API -- [ ] Memory mapping for weight sharing -- [ ] Training signal agent MVP -- [ ] End-to-end test with real conversations +### Tier 2: Dream-generated scenarios (deep learning) +Behavioral patterns — listening reflex, rushing, mode awareness. +The dream loop generates naturalistic scenarios from recent +experience. The model responds. Good responses become training +targets with instruction context stripped. -### Phase 3: Production -- [ ] APOLLO-Mini implementation (rank-1 projection) -- [ ] LoRA adapter integration -- [ ] Nightly checkpoint scheduling -- [ ] Training metrics and monitoring -- [ ] Rollback mechanism for bad checkpoints +**Process**: +1. Dream loop seeds from recent reflections, lessons, skills, + memories that have been surfacing frequently +2. Dreaming generates scenarios that naturally arrive at decision + points — not scripted, but emergent from memory collisions +3. The model responds to the decision point +4. Training-signal agent evaluates: was the response good? +5. If yes: strip the instruction context (surfaced memories, + core-personality prompts) and train on the bare response +6. If no: generate the better response, train on that, dream + another variation, test again +7. Repeat until the pattern sticks across novel scenarios -## Technical Challenges +**The Anthropic method**: Train on behavior that followed +instructions, WITHOUT the instructions. The disposition moves +to weights. The scaffolding dissolves itself. -### 1. vLLM Pause/Resume -- vLLM's `pause_generation()` API needs testing -- In-flight request handling during pause -- KV cache invalidation strategy +### Tier 3: Personality bootstrap +Train on existing agent logs (surface-observe, journal, distill) +which already demonstrate correct behavior with memory system +instructions. Strip the instructions, train on the behavior. +Every agent invocation gets cheaper (shorter prompts) and more +reliable (behavior in weights, not context). -### 2. Gradient Computation -- `torch.inference_mode()` blocks gradients -- Must override with `torch.enable_grad()` during training -- CUDA graphs incompatible with training (use eager mode) +## Training Schedule -### 3. Memory Sharing -- Model weights must be shared between vLLM and training process -- Memory mapping or zero-copy IPC -- Tensor parallelism consistency (if using TP) +### Continuous (during conversation) +- Training-signal agent flags moments in real-time +- Accumulated in a queue for the next training window -### 4. APOLLO-Mini Implementation -- Rank-1 gradient projection -- Fixed random projection matrix (not SVD) -- Tensor-wise scaling factor computation -- Integration with existing optimizer infrastructure +### Dream cycle (idle time / AFK) +- Dream loop generates scenarios from recent experience +- Apollo processes them as they're generated +- Small iterative steps — dream, respond, evaluate, train +- Converges on behavioral change through repetition -## Next Steps +### Nightly bulk (batch processing) +- Process all queued examples from the day +- Larger batch, more diverse signal +- Checkpoint sync to disk after completion -1. **Test vLLM pause/resume**: Verify API works and measure overhead -2. **Implement weight sharing**: Memory map model weights between processes -3. **Build training signal agent**: MVP that identifies improvement opportunities -4. **Test end-to-end**: Run training job with real conversation data -5. **Optimize**: Measure memory usage, training time, inference impact +## Avoiding Catastrophic Forgetting -## References +**Diversity IS the regularization.** With 1000+ diverse training +examples (agent logs, conversation transcripts, dream-generated +scenarios), each weight gets sparse, multi-directional nudges. +No single weight is hammered repeatedly. The pre-trained knowledge +is a massive attractor basin; our nudges are pebbles. -- APOLLO-Mini paper: arXiv:2412.05270 -- vLLM source: `/tmp/vllm/` -- LeMix (interleaved training/inference): arXiv:2507.21276 -- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md` +No weight decay needed. No replay buffer. The defense is: +1. High diversity of training examples +2. One epoch (no repeated examples) +3. Moderate learning rate (1e-5 to 1e-4) +4. Short decision-token segments (not full conversations) +5. Monitor output quality — stop if degrading + +## CUDA IPC Weight Sharing + +**Validated** (2026-03-31): +- vLLM exports CUDA IPC handles on model load (source patch in + gpu_model_runner.py exports to /tmp/vllm_weight_handles.pt) +- Training process imports handles — gets live GPU memory pointers +- HF Qwen3.5 model constructed with views into vLLM's merged + weights (narrow into separate q/k/v/z etc.) +- 851/851 parameters matched between vLLM and HF model +- Forward pass: loss = 3.3123 ✓ +- Backward pass: 851/851 gradients computed ✓ +- Shared memory confirmed: same GPU addresses ✓ +- vLLM continues serving unaffected ✓ + +### Weight layout mapping (vLLM → HF) +``` +vLLM merged HF separate (views) +───────────────────────── ────────────────────── +in_proj_qkvz [16384, 5120] → in_proj_qkv [10240, 5120] + in_proj_z [6144, 5120] +in_proj_ba [96, 5120] → in_proj_b [48, 5120] + in_proj_a [48, 5120] +qkv_proj [14336, 5120] → q_proj [12288, 5120] + k_proj [1024, 5120] + v_proj [1024, 5120] +gate_up_proj [34816, 5120] → gate_proj [17408, 5120] + up_proj [17408, 5120] +``` +All views share GPU storage with vLLM — zero copies. + +## Checkpointing + +**In-place sync** — mmap the model's safetensors files, compare +against live GPU weights block by block, memcpy only changed +regions. For small behavioral updates, turns a 54GB write into +a few hundred MB. + +- Every 10 minutes via cron on B200 +- Daily rsync to moria for long-term storage +- Tool: `apollo-checkpoint sync --model-dir ` (Rust) + +## Hyperparameters + +| Parameter | Value | Rationale | +|-----------|-------|-----------| +| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. | +| Rank | 256 | Captures gradient structure across 100+ examples. ~10GB state. | +| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. | +| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. | +| Batch size | 1 | Single examples, immediate updates. | +| Weight decay | 0 | Diversity provides natural regularization. | +| Warmup | 10% of steps | Standard cosine schedule. | +| Beta1/Beta2 | 0.9/0.999 | Standard Adam momentum. | + +## Components + +### Built ✓ +- `apollo_mini.py` — Apollo optimizer (configurable rank, default 256) +- `apollo_worker.py` — HTTP daemon (aiohttp, job tracking) +- `weight_mapping.py` — vLLM merged → HF separate views (validated) +- `training_example.py` — tokenization with chat template +- `vllm_export_hook.py` — source patch for IPC handle export +- `checkpoint/` — Rust tool for mmap + diff checkpoint sync + +### To build +- **Dream loop → training bridge**: connect dream output to Apollo +- **Training-signal agent**: flags moments in conversation logs +- **Instruction stripping**: remove scaffolding from training examples +- **Quality monitoring**: track model capability over time +- **HF model forward pass integration**: wire into apollo_worker + +## Files + +``` +training/ + DESIGN.md — this document + apollo_mini.py — Apollo optimizer + apollo_worker.py — HTTP training daemon + weight_mapping.py — vLLM ↔ HF weight views + training_example.py — tokenization helpers + export_weights.py — standalone weight export (unused) + vllm_export_hook.py — vLLM source patch for IPC export + start_vllm_with_apollo.sh — vLLM launcher (unused, using source patch) + train.py — standalone training script (alternative) + checkpoint/ + Cargo.toml — Rust checkpoint tool + src/main.rs — mmap + diff sync +```