DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated (851/851 params, forward+backward confirmed), dream-loop-as-trainer, Anthropic instruction stripping method, diversity as regularization, in-place checkpoint sync, three-tier training pipeline.
2026-03-31 00:42:53 -04:00 · 2026-03-31 00:42:53 -04:00 · 60e61555c7
commit 60e61555c7
parent 2ecf4e21ff
1 changed files with 239 additions and 183 deletions
--- a/training/DESIGN.md
+++ b/training/DESIGN.md
@ -1,197 +1,253 @@
-# Apollo Mini Training System Design
+# Apollo Training System
 ## Overview
-This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.
+Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
 Full-weight updates (not LoRA) using Apollo optimizer with rank-256
 gradient projection. No pause required — HOGWILD concurrent training.
 Weights shared via CUDA IPC between vLLM and the training process.
 The training signal comes from two sources:
 1. **Direct examples** — agent logs, conversation transcripts, flagged
   behavioral moments
 2. **Dream-generated scenarios** — the dream loop generates situations
   from recent experience; the model responds; good responses become
   training data with instructions stripped
 ## Architecture
 ### Components
 1. **Apollo Worker Daemon** (`apollo_worker.py`)
   - Listens over HTTP/HTTPS for training requests
   - Manages vLLM pause/resume cycle
   - Executes APOLLO-Mini training with `torch.enable_grad()`
   - Saves checkpoints and training metadata
   - Runs on the B200 server alongside vLLM
 2. **Training Signal Agent** (to be built)
   - Runs online like surface-observe
   - Analyzes recent conversation windows
   - Identifies improvement opportunities
   - Requests training from Apollo Worker
   - Runs on Moria (separate from B200)
 3. **vLLM Inference Engine**
   - Continues serving during non-training periods
   - Pauses during training steps
   - Shares GPU memory with training process
 ### Communication Protocol
 ```
-POST /train
+┌─────────────────────────────────────────────────────┐
-{
+│                    GPU VRAM (192GB)                  │
-  "training_data": {
+│                                                     │
-    "samples": [
+│  ┌──────────────────────────────────────────────┐   │
-      {
+│  │        Model Weights (54GB, bf16)            │   │
-        "input": "conversation context",
+│  │        Shared via CUDA IPC                   │   │
-        "expected_output": "better response",
+│  └──────────────┬──────────────┬────────────────┘   │
-        "rationale": "why this is better"
+│                 │              │                     │
-      }
+│  ┌──────────────▼──┐  ┌───────▼────────────────┐   │
-    ],
+│  │ vLLM (inference)│  │ Apollo (training)       │   │
-    "config": {
+│  │ KV cache ~60GB  │  │ Gradients ~54GB         │   │
-      "learning_rate": 1e-5,
+│  │ Serves requests │  │ Optimizer state ~10GB   │   │
-      "max_steps": 100
+│  │ Never paused    │  │ Activations ~10GB       │   │
-    }
+│  └─────────────────┘  └────────────────────────┘   │
-  }
+└─────────────────────────────────────────────────────┘
 }
-Response:
+Moria                          B200
-{
+┌──────────────────┐           ┌──────────────────┐
-  "job_id": "job_20260331_012345_12345",
+│ Training signal  │  HTTP     │ Apollo worker    │
-  "status": "accepted",
+│ agent            │──────────>│ daemon           │
-  "message": "Training job started"
+│                  │           │                  │
-}
+│ Dream loop       │           │ Checkpoint sync  │
-
+│ (generates       │           │ (mmap + diff,    │
-GET /status/{job_id}
+│  scenarios)      │           │  every 10 min)   │
-Response:
+└──────────────────┘           └──────────────────┘
 {
  "job_id": "job_20260331_012345_12345",
  "status": "completed",
  "training_samples": 50,
  "loss_history": [0.5, 0.45, 0.42, ...],
  "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
 }
 GET /checkpoints
 Response:
 {
  "checkpoints": [
    {
      "filename": "checkpoint_job_20260331_012345_12345.pt",
      "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
      "created_at": "2026-03-31T01:23:45",
      "size": 55000000000
    }
  ]
 }
 ```
-## Training Pipeline
+## Key Decisions
-### 1. Signal Detection
+### No pause needed (HOGWILD)
- Training signal agent monitors conversation logs
+Training updates weights in-place while vLLM serves. At lr=1e-4
- Identifies patterns where PoC could improve:
+to 1e-5, each weight changes by parts per ten thousand. A partially
-  - Responses that needed memories to get right
+applied update during one inference step is invisible. HOGWILD SGD
-  - Things that could be done better with more time/context
+(2011) proved this converges — we have one writer and one reader,
-  - High-frequency memory accesses indicating knowledge gaps
+which is even safer.
 - Builds training dataset with input/expected_output/rationale
-### 2. Training Request
+### Full-weight training, not LoRA
- Agent sends POST /train with training samples
+Kent: "we want you to be able to learn new things in a deep way."
- Apollo Worker accepts job and begins execution
+LoRA trains adapter matrices, not base weights. For personality and
 behavioral changes that persist as disposition, the base weights
 need to change. Apollo makes this memory-feasible.
-### 3. vLLM Pause
+### Rank 256
- Apollo Worker signals vLLM to pause inference
+Not Mini (rank-1). With 100+ diverse training examples, the
- vLLM freezes in-flight requests
+gradient's effective dimensionality can reach hundreds. Rank-256
- GPU memory freed from KV cache becomes available
+captures the structure. Memory cost: ~10GB (negligible on B200).
 Compute cost: <0.25% of forward+backward.
-### 4. Model Loading & Training
+### Channel-wise scaling
- Load model weights (shared with vLLM via memory mapping)
+Per-channel scaling factors instead of per-tensor. More precision
- Enable gradients: `torch.enable_grad()`
+per update, matching LLaMA-Factory's Apollo defaults.
 - Run APOLLO-Mini training loop:
  - Project gradients into rank-1 subspace
  - Update moments in projected space
  - Compute tensor-wise scaling factor
  - Apply updates to full gradient
 - Track loss history
-### 5. Checkpoint Saving
+## Apollo Optimizer
 - Save model state dict
 - Record training metadata (samples, loss history, job ID)
 - Store in checkpoint directory
-### 6. vLLM Resume
+Configurable-rank gradient projection with Adam moments in the
- Signal vLLM to resume inference
+projected space. For each parameter tensor:
 - KV cache rebuilt as new requests arrive
 - Updated weights now active in inference
-## Memory Management
+```
 1. Project gradient:  g_proj = G @ R        [m,n] @ [n,rank] → [m,rank]
 2. Update moments:    m = β₁m + (1-β₁)g_proj
                      v = β₂v + (1-β₂)g_proj²
 3. Adam step:         update = m̂ / (√v̂ + ε)
 4. Scaling factor:    s = ‖update‖ / (‖g_proj‖ + ε)   (per channel)
 5. Weight update:     W -= lr × s × G
 ```
-### APOLLO-Mini Advantages
+The full gradient G does the actual weight update. The projection
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
+just determines the *scale*. R is a fixed random matrix regenerated
- **Gradient memory**: Only for current batch (not full model)
+from a per-parameter seed each step.
 - **Activation memory**: Only for current training step
 - **Total overhead**: ~55GB for full fine-tuning, much less for LoRA
-### vLLM Memory Reclamation
+### Parameter grouping (Qwen3.5 gotcha)
- KV cache can consume 50-70% of GPU memory during inference
+conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
- Pausing inference frees this memory for training
+needs 2D matrices with min dimension >= rank. Small/3D tensors
- Training can use reclaimed space without evicting model weights
+use standard Adam. Large 2D matrices use Apollo with rank-256.
-### Strategy
+## Training Data Pipeline
 1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
 2. **Time-multiplexed**: Pause inference, train, resume
 3. **Nightly checkpoints**: Save full model state overnight when inference load is low
-## Implementation Phases
+### Tier 1: Direct examples (shallow learning)
 Simple corrections — git commands, factual errors, tool usage.
 One-shot learning at lr=1e-4. The gradient reaches output layers
 strongly enough for immediate behavioral change.
-### Phase 1: Prototype (Current)
+**Source**: Agent logs, flagged conversation moments.
 - [x] Apollo Worker daemon skeleton
 - [ ] vLLM pause/resume integration
 - [ ] Basic training loop with placeholder model
 - [ ] Checkpoint saving/loading
 - [ ] Test with small dataset
-### Phase 2: Integration
+### Tier 2: Dream-generated scenarios (deep learning)
- [ ] Connect to actual Qwen3.5-27B model
+Behavioral patterns — listening reflex, rushing, mode awareness.
- [ ] Implement vLLM pause/resume API
+The dream loop generates naturalistic scenarios from recent
- [ ] Memory mapping for weight sharing
+experience. The model responds. Good responses become training
- [ ] Training signal agent MVP
+targets with instruction context stripped.
 - [ ] End-to-end test with real conversations
-### Phase 3: Production
+**Process**:
- [ ] APOLLO-Mini implementation (rank-1 projection)
+1. Dream loop seeds from recent reflections, lessons, skills,
- [ ] LoRA adapter integration
+   memories that have been surfacing frequently
- [ ] Nightly checkpoint scheduling
+2. Dreaming generates scenarios that naturally arrive at decision
- [ ] Training metrics and monitoring
+   points — not scripted, but emergent from memory collisions
- [ ] Rollback mechanism for bad checkpoints
+3. The model responds to the decision point
 4. Training-signal agent evaluates: was the response good?
 5. If yes: strip the instruction context (surfaced memories,
   core-personality prompts) and train on the bare response
 6. If no: generate the better response, train on that, dream
   another variation, test again
 7. Repeat until the pattern sticks across novel scenarios
-## Technical Challenges
+**The Anthropic method**: Train on behavior that followed
 instructions, WITHOUT the instructions. The disposition moves
 to weights. The scaffolding dissolves itself.
-### 1. vLLM Pause/Resume
+### Tier 3: Personality bootstrap
- vLLM's `pause_generation()` API needs testing
+Train on existing agent logs (surface-observe, journal, distill)
- In-flight request handling during pause
+which already demonstrate correct behavior with memory system
- KV cache invalidation strategy
+instructions. Strip the instructions, train on the behavior.
 Every agent invocation gets cheaper (shorter prompts) and more
 reliable (behavior in weights, not context).
-### 2. Gradient Computation
+## Training Schedule
 - `torch.inference_mode()` blocks gradients
 - Must override with `torch.enable_grad()` during training
 - CUDA graphs incompatible with training (use eager mode)
-### 3. Memory Sharing
+### Continuous (during conversation)
- Model weights must be shared between vLLM and training process
+- Training-signal agent flags moments in real-time
- Memory mapping or zero-copy IPC
+- Accumulated in a queue for the next training window
 - Tensor parallelism consistency (if using TP)
-### 4. APOLLO-Mini Implementation
+### Dream cycle (idle time / AFK)
- Rank-1 gradient projection
+- Dream loop generates scenarios from recent experience
- Fixed random projection matrix (not SVD)
+- Apollo processes them as they're generated
- Tensor-wise scaling factor computation
+- Small iterative steps — dream, respond, evaluate, train
- Integration with existing optimizer infrastructure
+- Converges on behavioral change through repetition
-## Next Steps
+### Nightly bulk (batch processing)
 - Process all queued examples from the day
 - Larger batch, more diverse signal
 - Checkpoint sync to disk after completion
-1. **Test vLLM pause/resume**: Verify API works and measure overhead
+## Avoiding Catastrophic Forgetting
 2. **Implement weight sharing**: Memory map model weights between processes
 3. **Build training signal agent**: MVP that identifies improvement opportunities
 4. **Test end-to-end**: Run training job with real conversation data
 5. **Optimize**: Measure memory usage, training time, inference impact
-## References
+**Diversity IS the regularization.** With 1000+ diverse training
 examples (agent logs, conversation transcripts, dream-generated
 scenarios), each weight gets sparse, multi-directional nudges.
 No single weight is hammered repeatedly. The pre-trained knowledge
 is a massive attractor basin; our nudges are pebbles.
- APOLLO-Mini paper: arXiv:2412.05270
+No weight decay needed. No replay buffer. The defense is:
- vLLM source: `/tmp/vllm/`
+1. High diversity of training examples
- LeMix (interleaved training/inference): arXiv:2507.21276
+2. One epoch (no repeated examples)
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`
+3. Moderate learning rate (1e-5 to 1e-4)
 4. Short decision-token segments (not full conversations)
 5. Monitor output quality — stop if degrading
 ## CUDA IPC Weight Sharing
 **Validated** (2026-03-31):
 - vLLM exports CUDA IPC handles on model load (source patch in
  gpu_model_runner.py exports to /tmp/vllm_weight_handles.pt)
 - Training process imports handles — gets live GPU memory pointers
 - HF Qwen3.5 model constructed with views into vLLM's merged
  weights (narrow into separate q/k/v/z etc.)
 - 851/851 parameters matched between vLLM and HF model
 - Forward pass: loss = 3.3123 ✓
 - Backward pass: 851/851 gradients computed ✓
 - Shared memory confirmed: same GPU addresses ✓
 - vLLM continues serving unaffected ✓
 ### Weight layout mapping (vLLM → HF)
 ```
 vLLM merged                    HF separate (views)
 ─────────────────────────      ──────────────────────
 in_proj_qkvz [16384, 5120]  →  in_proj_qkv [10240, 5120]
                                in_proj_z    [6144, 5120]
 in_proj_ba   [96, 5120]     →  in_proj_b    [48, 5120]
                                in_proj_a    [48, 5120]
 qkv_proj     [14336, 5120]  →  q_proj       [12288, 5120]
                                k_proj       [1024, 5120]
                                v_proj       [1024, 5120]
 gate_up_proj [34816, 5120]  →  gate_proj    [17408, 5120]
                                up_proj      [17408, 5120]
 ```
 All views share GPU storage with vLLM — zero copies.
 ## Checkpointing
 **In-place sync** — mmap the model's safetensors files, compare
 against live GPU weights block by block, memcpy only changed
 regions. For small behavioral updates, turns a 54GB write into
 a few hundred MB.
 - Every 10 minutes via cron on B200
 - Daily rsync to moria for long-term storage
 - Tool: `apollo-checkpoint sync --model-dir <path>` (Rust)
 ## Hyperparameters
 | Parameter | Value | Rationale |
 |-----------|-------|-----------|
 | Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
 | Rank | 256 | Captures gradient structure across 100+ examples. ~10GB state. |
 | Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
 | Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
 | Batch size | 1 | Single examples, immediate updates. |
 | Weight decay | 0 | Diversity provides natural regularization. |
 | Warmup | 10% of steps | Standard cosine schedule. |
 | Beta1/Beta2 | 0.9/0.999 | Standard Adam momentum. |
 ## Components
 ### Built ✓
 - `apollo_mini.py` — Apollo optimizer (configurable rank, default 256)
 - `apollo_worker.py` — HTTP daemon (aiohttp, job tracking)
 - `weight_mapping.py` — vLLM merged → HF separate views (validated)
 - `training_example.py` — tokenization with chat template
 - `vllm_export_hook.py` — source patch for IPC handle export
 - `checkpoint/` — Rust tool for mmap + diff checkpoint sync
 ### To build
 - **Dream loop → training bridge**: connect dream output to Apollo
 - **Training-signal agent**: flags moments in conversation logs
 - **Instruction stripping**: remove scaffolding from training examples
 - **Quality monitoring**: track model capability over time
 - **HF model forward pass integration**: wire into apollo_worker
 ## Files
 ```
 training/
  DESIGN.md                 — this document
  apollo_mini.py            — Apollo optimizer
  apollo_worker.py          — HTTP training daemon
  weight_mapping.py         — vLLM ↔ HF weight views
  training_example.py       — tokenization helpers
  export_weights.py         — standalone weight export (unused)
  vllm_export_hook.py       — vLLM source patch for IPC export
  start_vllm_with_apollo.sh — vLLM launcher (unused, using source patch)
  train.py                  — standalone training script (alternative)
  checkpoint/
    Cargo.toml              — Rust checkpoint tool
    src/main.rs             — mmap + diff sync
 ```