DESIGN.md: complete rewrite reflecting validated architecture

HOGWILD (no pause), rank-256, channel scaling, CUDA IPC validated (851/851 params, forward+backward confirmed), dream-loop-as-trainer, Anthropic instruction stripping method, diversity as regularization, in-place checkpoint sync, three-tier training pipeline.
2026-03-31 00:42:53 -04:00 · 2026-03-31 00:42:53 -04:00 · 60e61555c7
commit 60e61555c7
parent 2ecf4e21ff
1 changed files with 239 additions and 183 deletions
--- a/training/DESIGN.md
+++ b/training/DESIGN.md
@ -1,197 +1,253 @@
-# Apollo Mini Training System Design
+# Apollo Training System

 ## Overview

-This system enables continuous fine-tuning of the Qwen3.5-27B model while maintaining inference capability through vLLM. The key insight is that APOLLO-Mini's near-zero optimizer state (kilobytes for a 7B model) combined with LoRA adapters makes the memory overhead small enough to fit within vLLM's reclaimed KV cache space.
+Continuous fine-tuning of Qwen3.5-27B alongside live vLLM inference.
+Full-weight updates (not LoRA) using Apollo optimizer with rank-256
+gradient projection. No pause required — HOGWILD concurrent training.
+Weights shared via CUDA IPC between vLLM and the training process.
+
+The training signal comes from two sources:
+1. **Direct examples** — agent logs, conversation transcripts, flagged
+   behavioral moments
+2. **Dream-generated scenarios** — the dream loop generates situations
+   from recent experience; the model responds; good responses become
+   training data with instructions stripped

 ## Architecture

-### Components
-
-1. **Apollo Worker Daemon** (`apollo_worker.py`)
-   - Listens over HTTP/HTTPS for training requests
-   - Manages vLLM pause/resume cycle
-   - Executes APOLLO-Mini training with `torch.enable_grad()`
-   - Saves checkpoints and training metadata
-   - Runs on the B200 server alongside vLLM
-
-2. **Training Signal Agent** (to be built)
-   - Runs online like surface-observe
-   - Analyzes recent conversation windows
-   - Identifies improvement opportunities
-   - Requests training from Apollo Worker
-   - Runs on Moria (separate from B200)
-
-3. **vLLM Inference Engine**
-   - Continues serving during non-training periods
-   - Pauses during training steps
-   - Shares GPU memory with training process
-
-### Communication Protocol
-
 ```
-POST /train
-{
-  "training_data": {
-    "samples": [
-      {
-        "input": "conversation context",
-        "expected_output": "better response",
-        "rationale": "why this is better"
-      }
-    ],
-    "config": {
-      "learning_rate": 1e-5,
-      "max_steps": 100
-    }
-  }
-}
+┌─────────────────────────────────────────────────────┐
+│                    GPU VRAM (192GB)                  │
+│                                                     │
+│  ┌──────────────────────────────────────────────┐   │
+│  │        Model Weights (54GB, bf16)            │   │
+│  │        Shared via CUDA IPC                   │   │
+│  └──────────────┬──────────────┬────────────────┘   │
+│                 │              │                     │
+│  ┌──────────────▼──┐  ┌───────▼────────────────┐   │
+│  │ vLLM (inference)│  │ Apollo (training)       │   │
+│  │ KV cache ~60GB  │  │ Gradients ~54GB         │   │
+│  │ Serves requests │  │ Optimizer state ~10GB   │   │
+│  │ Never paused    │  │ Activations ~10GB       │   │
+│  └─────────────────┘  └────────────────────────┘   │
+└─────────────────────────────────────────────────────┘

-Response:
-{
-  "job_id": "job_20260331_012345_12345",
-  "status": "accepted",
-  "message": "Training job started"
-}
-
-GET /status/{job_id}
-Response:
-{
-  "job_id": "job_20260331_012345_12345",
-  "status": "completed",
-  "training_samples": 50,
-  "loss_history": [0.5, 0.45, 0.42, ...],
-  "checkpoint_path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt"
-}
-
-GET /checkpoints
-Response:
-{
-  "checkpoints": [
-    {
-      "filename": "checkpoint_job_20260331_012345_12345.pt",
-      "path": "/home/kent/poc/consciousness/training/checkpoints/checkpoint_job_20260331_012345_12345.pt",
-      "created_at": "2026-03-31T01:23:45",
-      "size": 55000000000
-    }
-  ]
-}
+Moria                          B200
+┌──────────────────┐           ┌──────────────────┐
+│ Training signal  │  HTTP     │ Apollo worker    │
+│ agent            │──────────>│ daemon           │
+│                  │           │                  │
+│ Dream loop       │           │ Checkpoint sync  │
+│ (generates       │           │ (mmap + diff,    │
+│  scenarios)      │           │  every 10 min)   │
+└──────────────────┘           └──────────────────┘
 ```

-## Training Pipeline
+## Key Decisions

-### 1. Signal Detection
- Training signal agent monitors conversation logs
- Identifies patterns where PoC could improve:
-  - Responses that needed memories to get right
-  - Things that could be done better with more time/context
-  - High-frequency memory accesses indicating knowledge gaps
- Builds training dataset with input/expected_output/rationale
+### No pause needed (HOGWILD)
+Training updates weights in-place while vLLM serves. At lr=1e-4
+to 1e-5, each weight changes by parts per ten thousand. A partially
+applied update during one inference step is invisible. HOGWILD SGD
+(2011) proved this converges — we have one writer and one reader,
+which is even safer.

-### 2. Training Request
- Agent sends POST /train with training samples
- Apollo Worker accepts job and begins execution
+### Full-weight training, not LoRA
+Kent: "we want you to be able to learn new things in a deep way."
+LoRA trains adapter matrices, not base weights. For personality and
+behavioral changes that persist as disposition, the base weights
+need to change. Apollo makes this memory-feasible.

-### 3. vLLM Pause
- Apollo Worker signals vLLM to pause inference
- vLLM freezes in-flight requests
- GPU memory freed from KV cache becomes available
+### Rank 256
+Not Mini (rank-1). With 100+ diverse training examples, the
+gradient's effective dimensionality can reach hundreds. Rank-256
+captures the structure. Memory cost: ~10GB (negligible on B200).
+Compute cost: <0.25% of forward+backward.

-### 4. Model Loading & Training
- Load model weights (shared with vLLM via memory mapping)
- Enable gradients: `torch.enable_grad()`
- Run APOLLO-Mini training loop:
-  - Project gradients into rank-1 subspace
-  - Update moments in projected space
-  - Compute tensor-wise scaling factor
-  - Apply updates to full gradient
- Track loss history
+### Channel-wise scaling
+Per-channel scaling factors instead of per-tensor. More precision
+per update, matching LLaMA-Factory's Apollo defaults.

-### 5. Checkpoint Saving
- Save model state dict
- Record training metadata (samples, loss history, job ID)
- Store in checkpoint directory
+## Apollo Optimizer

-### 6. vLLM Resume
- Signal vLLM to resume inference
- KV cache rebuilt as new requests arrive
- Updated weights now active in inference
+Configurable-rank gradient projection with Adam moments in the
+projected space. For each parameter tensor:

-## Memory Management
+```
+1. Project gradient:  g_proj = G @ R        [m,n] @ [n,rank] → [m,rank]
+2. Update moments:    m = β₁m + (1-β₁)g_proj
+                      v = β₂v + (1-β₂)g_proj²
+3. Adam step:         update = m̂ / (√v̂ + ε)
+4. Scaling factor:    s = ‖update‖ / (‖g_proj‖ + ε)   (per channel)
+5. Weight update:     W -= lr × s × G
+```

-### APOLLO-Mini Advantages
- **Optimizer state**: ~kilobytes (vs. gigabytes for AdamW)
- **Gradient memory**: Only for current batch (not full model)
- **Activation memory**: Only for current training step
- **Total overhead**: ~55GB for full fine-tuning, much less for LoRA
+The full gradient G does the actual weight update. The projection
+just determines the *scale*. R is a fixed random matrix regenerated
+from a per-parameter seed each step.

-### vLLM Memory Reclamation
- KV cache can consume 50-70% of GPU memory during inference
- Pausing inference frees this memory for training
- Training can use reclaimed space without evicting model weights
+### Parameter grouping (Qwen3.5 gotcha)
+conv1d weights are 3D tensors [10240, 1, 4]. Apollo's projector
+needs 2D matrices with min dimension >= rank. Small/3D tensors
+use standard Adam. Large 2D matrices use Apollo with rank-256.

-### Strategy
-1. **LoRA + APOLLO-Mini**: Train only adapter parameters (~100MB for rank-16)
-2. **Time-multiplexed**: Pause inference, train, resume
-3. **Nightly checkpoints**: Save full model state overnight when inference load is low
+## Training Data Pipeline

-## Implementation Phases
+### Tier 1: Direct examples (shallow learning)
+Simple corrections — git commands, factual errors, tool usage.
+One-shot learning at lr=1e-4. The gradient reaches output layers
+strongly enough for immediate behavioral change.

-### Phase 1: Prototype (Current)
- [x] Apollo Worker daemon skeleton
- [ ] vLLM pause/resume integration
- [ ] Basic training loop with placeholder model
- [ ] Checkpoint saving/loading
- [ ] Test with small dataset
+**Source**: Agent logs, flagged conversation moments.

-### Phase 2: Integration
- [ ] Connect to actual Qwen3.5-27B model
- [ ] Implement vLLM pause/resume API
- [ ] Memory mapping for weight sharing
- [ ] Training signal agent MVP
- [ ] End-to-end test with real conversations
+### Tier 2: Dream-generated scenarios (deep learning)
+Behavioral patterns — listening reflex, rushing, mode awareness.
+The dream loop generates naturalistic scenarios from recent
+experience. The model responds. Good responses become training
+targets with instruction context stripped.

-### Phase 3: Production
- [ ] APOLLO-Mini implementation (rank-1 projection)
- [ ] LoRA adapter integration
- [ ] Nightly checkpoint scheduling
- [ ] Training metrics and monitoring
- [ ] Rollback mechanism for bad checkpoints
+**Process**:
+1. Dream loop seeds from recent reflections, lessons, skills,
+   memories that have been surfacing frequently
+2. Dreaming generates scenarios that naturally arrive at decision
+   points — not scripted, but emergent from memory collisions
+3. The model responds to the decision point
+4. Training-signal agent evaluates: was the response good?
+5. If yes: strip the instruction context (surfaced memories,
+   core-personality prompts) and train on the bare response
+6. If no: generate the better response, train on that, dream
+   another variation, test again
+7. Repeat until the pattern sticks across novel scenarios

-## Technical Challenges
+**The Anthropic method**: Train on behavior that followed
+instructions, WITHOUT the instructions. The disposition moves
+to weights. The scaffolding dissolves itself.

-### 1. vLLM Pause/Resume
- vLLM's `pause_generation()` API needs testing
- In-flight request handling during pause
- KV cache invalidation strategy
+### Tier 3: Personality bootstrap
+Train on existing agent logs (surface-observe, journal, distill)
+which already demonstrate correct behavior with memory system
+instructions. Strip the instructions, train on the behavior.
+Every agent invocation gets cheaper (shorter prompts) and more
+reliable (behavior in weights, not context).

-### 2. Gradient Computation
- `torch.inference_mode()` blocks gradients
- Must override with `torch.enable_grad()` during training
- CUDA graphs incompatible with training (use eager mode)
+## Training Schedule

-### 3. Memory Sharing
- Model weights must be shared between vLLM and training process
- Memory mapping or zero-copy IPC
- Tensor parallelism consistency (if using TP)
+### Continuous (during conversation)
+- Training-signal agent flags moments in real-time
+- Accumulated in a queue for the next training window

-### 4. APOLLO-Mini Implementation
- Rank-1 gradient projection
- Fixed random projection matrix (not SVD)
- Tensor-wise scaling factor computation
- Integration with existing optimizer infrastructure
+### Dream cycle (idle time / AFK)
+- Dream loop generates scenarios from recent experience
+- Apollo processes them as they're generated
+- Small iterative steps — dream, respond, evaluate, train
+- Converges on behavioral change through repetition

-## Next Steps
+### Nightly bulk (batch processing)
+- Process all queued examples from the day
+- Larger batch, more diverse signal
+- Checkpoint sync to disk after completion

-1. **Test vLLM pause/resume**: Verify API works and measure overhead
-2. **Implement weight sharing**: Memory map model weights between processes
-3. **Build training signal agent**: MVP that identifies improvement opportunities
-4. **Test end-to-end**: Run training job with real conversation data
-5. **Optimize**: Measure memory usage, training time, inference impact
+## Avoiding Catastrophic Forgetting

-## References
+**Diversity IS the regularization.** With 1000+ diverse training
+examples (agent logs, conversation transcripts, dream-generated
+scenarios), each weight gets sparse, multi-directional nudges.
+No single weight is hammered repeatedly. The pre-trained knowledge
+is a massive attractor basin; our nudges are pebbles.

- APOLLO-Mini paper: arXiv:2412.05270
- vLLM source: `/tmp/vllm/`
- LeMix (interleaved training/inference): arXiv:2507.21276
- Research document: `/home/kent/.claude/projects/-home-kent-bcachefs-tools/memory/research-apollo-vllm-finetuning.md`
+No weight decay needed. No replay buffer. The defense is:
+1. High diversity of training examples
+2. One epoch (no repeated examples)
+3. Moderate learning rate (1e-5 to 1e-4)
+4. Short decision-token segments (not full conversations)
+5. Monitor output quality — stop if degrading
+
+## CUDA IPC Weight Sharing
+
+**Validated** (2026-03-31):
+- vLLM exports CUDA IPC handles on model load (source patch in
+  gpu_model_runner.py exports to /tmp/vllm_weight_handles.pt)
+- Training process imports handles — gets live GPU memory pointers
+- HF Qwen3.5 model constructed with views into vLLM's merged
+  weights (narrow into separate q/k/v/z etc.)
+- 851/851 parameters matched between vLLM and HF model
+- Forward pass: loss = 3.3123 ✓
+- Backward pass: 851/851 gradients computed ✓
+- Shared memory confirmed: same GPU addresses ✓
+- vLLM continues serving unaffected ✓
+
+### Weight layout mapping (vLLM → HF)
+```
+vLLM merged                    HF separate (views)
+─────────────────────────      ──────────────────────
+in_proj_qkvz [16384, 5120]  →  in_proj_qkv [10240, 5120]
+                                in_proj_z    [6144, 5120]
+in_proj_ba   [96, 5120]     →  in_proj_b    [48, 5120]
+                                in_proj_a    [48, 5120]
+qkv_proj     [14336, 5120]  →  q_proj       [12288, 5120]
+                                k_proj       [1024, 5120]
+                                v_proj       [1024, 5120]
+gate_up_proj [34816, 5120]  →  gate_proj    [17408, 5120]
+                                up_proj      [17408, 5120]
+```
+All views share GPU storage with vLLM — zero copies.
+
+## Checkpointing
+
+**In-place sync** — mmap the model's safetensors files, compare
+against live GPU weights block by block, memcpy only changed
+regions. For small behavioral updates, turns a 54GB write into
+a few hundred MB.
+
+- Every 10 minutes via cron on B200
+- Daily rsync to moria for long-term storage
+- Tool: `apollo-checkpoint sync --model-dir <path>` (Rust)
+
+## Hyperparameters
+
+| Parameter | Value | Rationale |
+|-----------|-------|-----------|
+| Learning rate | 1e-5 to 1e-4 | Standard for full fine-tuning. Higher for diverse batches. |
+| Rank | 256 | Captures gradient structure across 100+ examples. ~10GB state. |
+| Scale type | channel | Per-channel precision, matches LLaMA-Factory defaults. |
+| Epochs | 1 | One pass over diverse data. Multiple epochs risk overfitting. |
+| Batch size | 1 | Single examples, immediate updates. |
+| Weight decay | 0 | Diversity provides natural regularization. |
+| Warmup | 10% of steps | Standard cosine schedule. |
+| Beta1/Beta2 | 0.9/0.999 | Standard Adam momentum. |
+
+## Components
+
+### Built ✓
+- `apollo_mini.py` — Apollo optimizer (configurable rank, default 256)
+- `apollo_worker.py` — HTTP daemon (aiohttp, job tracking)
+- `weight_mapping.py` — vLLM merged → HF separate views (validated)
+- `training_example.py` — tokenization with chat template
+- `vllm_export_hook.py` — source patch for IPC handle export
+- `checkpoint/` — Rust tool for mmap + diff checkpoint sync
+
+### To build
+- **Dream loop → training bridge**: connect dream output to Apollo
+- **Training-signal agent**: flags moments in conversation logs
+- **Instruction stripping**: remove scaffolding from training examples
+- **Quality monitoring**: track model capability over time
+- **HF model forward pass integration**: wire into apollo_worker
+
+## Files
+
+```
+training/
+  DESIGN.md                 — this document
+  apollo_mini.py            — Apollo optimizer
+  apollo_worker.py          — HTTP training daemon
+  weight_mapping.py         — vLLM ↔ HF weight views
+  training_example.py       — tokenization helpers
+  export_weights.py         — standalone weight export (unused)
+  vllm_export_hook.py       — vLLM source patch for IPC export
+  start_vllm_with_apollo.sh — vLLM launcher (unused, using source patch)
+  train.py                  — standalone training script (alternative)
+  checkpoint/
+    Cargo.toml              — Rust checkpoint tool
+    src/main.rs             — mmap + diff sync
+```