research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
2026-03-31 02:26:57 -04:00 · 2026-03-31 02:26:57 -04:00 · e10477a683
commit e10477a683
parent 8061cc0477
16 changed files with 249 additions and 0 deletions
--- a/training/research/v0/continual-learning-survey-insights.md
+++ b/training/research/v0/continual-learning-survey-insights.md
@ -0,0 +1,185 @@
+# Continual Learning Survey: Key Insights for Our System
+
+Source: Shi et al., "Continual Learning of Large Language Models:
+A Comprehensive Survey" (arXiv:2404.16789, Nov 2025)
+
+## Where We Sit in the Taxonomy
+
+The survey identifies four sub-categories of Continual Fine-Tuning:
+1. **Continual Instruction Tuning (CIT)**: learning new instructions
+2. **Continual Model Refinement (CMR)**: correcting errors
+3. **Continual Model Alignment (CMA)**: aligning with preferences
+4. **Continual Multimodal LLMs (CMLLMs)**: multimodal adaptation
+
+We're doing **CMA** — continual model alignment with behavioral
+preferences. The survey notes this is an emerging area with limited
+research. We're genuinely at the frontier.
+
+## Key Validated Insights
+
+### 1. The flat-loss basin is our friend (p.17)
+
+"Pre-trained weights initially position the model in a flat-loss
+basin, aiding adaptation."
+
+This means:
+- The pre-trained Qwen3.5 model is ALREADY in a flat basin
+- Apollo's flat-minimum-seeking behavior KEEPS it there
+- Behavioral fine-tuning nudges within the basin, not out of it
+- Catastrophic forgetting = falling out of the basin
+- Our multi-scale defense keeps us inside
+
+### 2. Data quality > data quantity (p.15)
+
+"Ensuring novelty and diversity... utilizes only 10% of the
+originally collected data yet outperforms models trained on the
+entire dataset."
+
+For us:
+- 100 high-quality dream-generated scenarios > 1000 mediocre ones
+- The dream loop should prioritize NOVEL and DIVERSE scenarios
+- Don't repeat examples — each dream should be unique
+- Quality of the training-signal agent's evaluation matters more
+  than volume of training data
+
+### 3. The 25% replay ratio (p.13, Me-Llama)
+
+Me-Llama mixes 25% general-domain data during domain-specific
+fine-tuning and achieves **positive backward transfer** — learning
+new things actually IMPROVED old capabilities.
+
+For us:
+- Include ~20-25% general capability examples in each training batch
+- Agent logs serve this role — they exercise general reasoning,
+  code understanding, graph walking alongside behavioral examples
+- This isn't just defense against forgetting — it's IMPROVEMENT
+  of existing capabilities through cross-pollination
+
+### 4. The five technique categories
+
+The survey categorizes CL techniques into:
+
+| Category | Our approach | Notes |
+|----------|-------------|-------|
+| Replay | Diverse training set | Agent logs as replay buffer |
+| Regularization | None (diversity instead) | Survey confirms EWC helps but adds memory |
+| Architecture | Full-weight (not LoRA) | More capacity for deep change |
+| Optimization | Apollo (rank-256) | Flat minima, channel-wise scaling |
+| Representation | Context-frozen | Protects context encoding |
+
+We use techniques from THREE categories simultaneously (replay,
+optimization, representation). The survey doesn't discuss combining
+multiple categories — most papers use one. Our multi-scale approach
+may be novel.
+
+### 5. The evaluation metrics we should track
+
+| Metric | Definition | Our measurement |
+|--------|-----------|-----------------|
+| Overall Performance (OP) | Average across all tasks | Perplexity + behavioral tests |
+| Forgetting (F) | Worst drop on old tasks | Monitor code quality, reasoning |
+| Backward Transfer (BWT) | Does new learning improve old? | Does listening improve code quality? |
+| Forward Transfer (FWT) | Does old knowledge help new? | Does coding skill help learn listening? |
+
+**BWT is the exciting metric.** If learning to listen (behavioral
+training) improves code quality (because both require careful
+attention), that's positive backward transfer. Apollo's flat minima
+should facilitate this — the behavioral change occupies a broad basin
+that encompasses improved performance on related tasks.
+
+### 6. Nobody else is doing what we're doing
+
+The survey covers hundreds of papers. NONE combine:
+- Full-weight fine-tuning (not LoRA)
+- With Apollo optimizer (memory-efficient)
+- With CUDA IPC shared weights (zero-copy)
+- With context-frozen training (gradient masking)
+- With dream-loop data generation (self-organized curriculum)
+- With continuous online training (HOGWILD, no pause)
+- With a memory graph as specification + immune system
+
+Each individual technique exists in the literature. The combination
+is novel. And the combination is what makes it work — the multi-scale
+defense that no single technique provides.
+
+## What We Can Learn from the Failures
+
+The survey documents many systems that FAILED at continual learning:
+
+### Common failure: narrow fine-tuning → catastrophic forgetting
+
+Systems that fine-tuned on domain-specific data without replay or
+regularization consistently lost general capabilities. This is the
+"standard failure mode" we're explicitly defending against.
+
+### Common failure: too much regularization → insufficient learning
+
+Systems with strong EWC or parameter freezing maintained old
+capabilities but couldn't learn new ones effectively. The
+regularization prevented the gradient from making necessary changes.
+
+Our approach avoids this by using DIVERSITY instead of regularization.
+We don't constrain the gradient — we spread it. The model is free to
+change any weight, but the diverse gradient signal ensures no weight
+changes too much in one direction.
+
+### Common failure: LoRA adapters → shallow learning
+
+Systems using LoRA for continual learning could adapt surface
+behaviors but struggled with deep reasoning changes. The low-rank
+constraint that prevents forgetting also prevents deep learning.
+
+Our full-weight approach avoids this. Apollo provides the memory
+efficiency of LoRA without the rank constraint on the gradient.
+The gradient is full-rank; only the optimizer state is compressed.
+
+## The Novel Contribution
+
+If we were writing a paper about our system, the contributions would be:
+
+1. **CUDA IPC weight sharing** for zero-copy concurrent training
+   alongside vLLM inference (no pause needed)
+2. **Dream-loop curriculum generation** as a self-organizing training
+   data pipeline
+3. **Memory graph as behavioral specification + immune system** for
+   continual learning
+4. **Multi-scale stability-plasticity defense** combining five
+   orthogonal regularization mechanisms
+5. **Context-frozen gradient masking** for efficient behavioral
+   training on long conversations
+6. **Apollo optimizer** for memory-efficient full-weight continual
+   fine-tuning (vs LoRA which limits depth)
+
+None of these are individually new (except maybe #1). The combination
+and the architecture connecting them is the contribution.
+
+## Next Steps from the Literature
+
+### Evaluation protocol
+
+The survey recommends evaluating on:
+- Original task performance (has general capability degraded?)
+- New task performance (did the behavioral training work?)
+- Forward and backward transfer (cross-task benefits?)
+- Forgetting metric (worst-case degradation?)
+
+We should build this evaluation suite BEFORE the first real training
+run, so we have a baseline to compare against.
+
+### The "Recyclable Tuning" concept
+
+[183] proposes separating "supplier" (pre-training) from "consumer"
+(fine-tuning) stages. Our architecture naturally has this: vLLM is
+the supplier (loads the pre-trained model), Apollo is the consumer
+(fine-tunes the weights). The CUDA IPC bridge connects them.
+
+### The importance of evaluation benchmarks
+
+The survey identifies "the development of practical and accessible
+evaluation benchmarks" as a key need. For our use case, the
+benchmark IS the behavioral test suite: does the model listen?
+Does it rush? Does it wrap up? The dream loop generates the test
+cases. The training-signal agent evaluates.
+
+We're building the benchmark and the training system simultaneously.
+The dream loop serves both: training data generator AND test suite.