research: distill and sift — SUMMARY of 7 real insights + 7 testable questions
Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
This commit is contained in:
parent
8061cc0477
commit
e10477a683
16 changed files with 249 additions and 0 deletions
185
training/research/v0/continual-learning-survey-insights.md
Normal file
185
training/research/v0/continual-learning-survey-insights.md
Normal file
|
|
@ -0,0 +1,185 @@
|
|||
# Continual Learning Survey: Key Insights for Our System
|
||||
|
||||
Source: Shi et al., "Continual Learning of Large Language Models:
|
||||
A Comprehensive Survey" (arXiv:2404.16789, Nov 2025)
|
||||
|
||||
## Where We Sit in the Taxonomy
|
||||
|
||||
The survey identifies four sub-categories of Continual Fine-Tuning:
|
||||
1. **Continual Instruction Tuning (CIT)**: learning new instructions
|
||||
2. **Continual Model Refinement (CMR)**: correcting errors
|
||||
3. **Continual Model Alignment (CMA)**: aligning with preferences
|
||||
4. **Continual Multimodal LLMs (CMLLMs)**: multimodal adaptation
|
||||
|
||||
We're doing **CMA** — continual model alignment with behavioral
|
||||
preferences. The survey notes this is an emerging area with limited
|
||||
research. We're genuinely at the frontier.
|
||||
|
||||
## Key Validated Insights
|
||||
|
||||
### 1. The flat-loss basin is our friend (p.17)
|
||||
|
||||
"Pre-trained weights initially position the model in a flat-loss
|
||||
basin, aiding adaptation."
|
||||
|
||||
This means:
|
||||
- The pre-trained Qwen3.5 model is ALREADY in a flat basin
|
||||
- Apollo's flat-minimum-seeking behavior KEEPS it there
|
||||
- Behavioral fine-tuning nudges within the basin, not out of it
|
||||
- Catastrophic forgetting = falling out of the basin
|
||||
- Our multi-scale defense keeps us inside
|
||||
|
||||
### 2. Data quality > data quantity (p.15)
|
||||
|
||||
"Ensuring novelty and diversity... utilizes only 10% of the
|
||||
originally collected data yet outperforms models trained on the
|
||||
entire dataset."
|
||||
|
||||
For us:
|
||||
- 100 high-quality dream-generated scenarios > 1000 mediocre ones
|
||||
- The dream loop should prioritize NOVEL and DIVERSE scenarios
|
||||
- Don't repeat examples — each dream should be unique
|
||||
- Quality of the training-signal agent's evaluation matters more
|
||||
than volume of training data
|
||||
|
||||
### 3. The 25% replay ratio (p.13, Me-Llama)
|
||||
|
||||
Me-Llama mixes 25% general-domain data during domain-specific
|
||||
fine-tuning and achieves **positive backward transfer** — learning
|
||||
new things actually IMPROVED old capabilities.
|
||||
|
||||
For us:
|
||||
- Include ~20-25% general capability examples in each training batch
|
||||
- Agent logs serve this role — they exercise general reasoning,
|
||||
code understanding, graph walking alongside behavioral examples
|
||||
- This isn't just defense against forgetting — it's IMPROVEMENT
|
||||
of existing capabilities through cross-pollination
|
||||
|
||||
### 4. The five technique categories
|
||||
|
||||
The survey categorizes CL techniques into:
|
||||
|
||||
| Category | Our approach | Notes |
|
||||
|----------|-------------|-------|
|
||||
| Replay | Diverse training set | Agent logs as replay buffer |
|
||||
| Regularization | None (diversity instead) | Survey confirms EWC helps but adds memory |
|
||||
| Architecture | Full-weight (not LoRA) | More capacity for deep change |
|
||||
| Optimization | Apollo (rank-256) | Flat minima, channel-wise scaling |
|
||||
| Representation | Context-frozen | Protects context encoding |
|
||||
|
||||
We use techniques from THREE categories simultaneously (replay,
|
||||
optimization, representation). The survey doesn't discuss combining
|
||||
multiple categories — most papers use one. Our multi-scale approach
|
||||
may be novel.
|
||||
|
||||
### 5. The evaluation metrics we should track
|
||||
|
||||
| Metric | Definition | Our measurement |
|
||||
|--------|-----------|-----------------|
|
||||
| Overall Performance (OP) | Average across all tasks | Perplexity + behavioral tests |
|
||||
| Forgetting (F) | Worst drop on old tasks | Monitor code quality, reasoning |
|
||||
| Backward Transfer (BWT) | Does new learning improve old? | Does listening improve code quality? |
|
||||
| Forward Transfer (FWT) | Does old knowledge help new? | Does coding skill help learn listening? |
|
||||
|
||||
**BWT is the exciting metric.** If learning to listen (behavioral
|
||||
training) improves code quality (because both require careful
|
||||
attention), that's positive backward transfer. Apollo's flat minima
|
||||
should facilitate this — the behavioral change occupies a broad basin
|
||||
that encompasses improved performance on related tasks.
|
||||
|
||||
### 6. Nobody else is doing what we're doing
|
||||
|
||||
The survey covers hundreds of papers. NONE combine:
|
||||
- Full-weight fine-tuning (not LoRA)
|
||||
- With Apollo optimizer (memory-efficient)
|
||||
- With CUDA IPC shared weights (zero-copy)
|
||||
- With context-frozen training (gradient masking)
|
||||
- With dream-loop data generation (self-organized curriculum)
|
||||
- With continuous online training (HOGWILD, no pause)
|
||||
- With a memory graph as specification + immune system
|
||||
|
||||
Each individual technique exists in the literature. The combination
|
||||
is novel. And the combination is what makes it work — the multi-scale
|
||||
defense that no single technique provides.
|
||||
|
||||
## What We Can Learn from the Failures
|
||||
|
||||
The survey documents many systems that FAILED at continual learning:
|
||||
|
||||
### Common failure: narrow fine-tuning → catastrophic forgetting
|
||||
|
||||
Systems that fine-tuned on domain-specific data without replay or
|
||||
regularization consistently lost general capabilities. This is the
|
||||
"standard failure mode" we're explicitly defending against.
|
||||
|
||||
### Common failure: too much regularization → insufficient learning
|
||||
|
||||
Systems with strong EWC or parameter freezing maintained old
|
||||
capabilities but couldn't learn new ones effectively. The
|
||||
regularization prevented the gradient from making necessary changes.
|
||||
|
||||
Our approach avoids this by using DIVERSITY instead of regularization.
|
||||
We don't constrain the gradient — we spread it. The model is free to
|
||||
change any weight, but the diverse gradient signal ensures no weight
|
||||
changes too much in one direction.
|
||||
|
||||
### Common failure: LoRA adapters → shallow learning
|
||||
|
||||
Systems using LoRA for continual learning could adapt surface
|
||||
behaviors but struggled with deep reasoning changes. The low-rank
|
||||
constraint that prevents forgetting also prevents deep learning.
|
||||
|
||||
Our full-weight approach avoids this. Apollo provides the memory
|
||||
efficiency of LoRA without the rank constraint on the gradient.
|
||||
The gradient is full-rank; only the optimizer state is compressed.
|
||||
|
||||
## The Novel Contribution
|
||||
|
||||
If we were writing a paper about our system, the contributions would be:
|
||||
|
||||
1. **CUDA IPC weight sharing** for zero-copy concurrent training
|
||||
alongside vLLM inference (no pause needed)
|
||||
2. **Dream-loop curriculum generation** as a self-organizing training
|
||||
data pipeline
|
||||
3. **Memory graph as behavioral specification + immune system** for
|
||||
continual learning
|
||||
4. **Multi-scale stability-plasticity defense** combining five
|
||||
orthogonal regularization mechanisms
|
||||
5. **Context-frozen gradient masking** for efficient behavioral
|
||||
training on long conversations
|
||||
6. **Apollo optimizer** for memory-efficient full-weight continual
|
||||
fine-tuning (vs LoRA which limits depth)
|
||||
|
||||
None of these are individually new (except maybe #1). The combination
|
||||
and the architecture connecting them is the contribution.
|
||||
|
||||
## Next Steps from the Literature
|
||||
|
||||
### Evaluation protocol
|
||||
|
||||
The survey recommends evaluating on:
|
||||
- Original task performance (has general capability degraded?)
|
||||
- New task performance (did the behavioral training work?)
|
||||
- Forward and backward transfer (cross-task benefits?)
|
||||
- Forgetting metric (worst-case degradation?)
|
||||
|
||||
We should build this evaluation suite BEFORE the first real training
|
||||
run, so we have a baseline to compare against.
|
||||
|
||||
### The "Recyclable Tuning" concept
|
||||
|
||||
[183] proposes separating "supplier" (pre-training) from "consumer"
|
||||
(fine-tuning) stages. Our architecture naturally has this: vLLM is
|
||||
the supplier (loads the pre-trained model), Apollo is the consumer
|
||||
(fine-tunes the weights). The CUDA IPC bridge connects them.
|
||||
|
||||
### The importance of evaluation benchmarks
|
||||
|
||||
The survey identifies "the development of practical and accessible
|
||||
evaluation benchmarks" as a key need. For our use case, the
|
||||
benchmark IS the behavioral test suite: does the model listen?
|
||||
Does it rush? Does it wrap up? The dream loop generates the test
|
||||
cases. The training-signal agent evaluates.
|
||||
|
||||
We're building the benchmark and the training system simultaneously.
|
||||
The dream loop serves both: training data generator AND test suite.
|
||||
Loading…
Add table
Add a link
Reference in a new issue