Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
185 lines
7.6 KiB
Markdown
185 lines
7.6 KiB
Markdown
# Continual Learning Survey: Key Insights for Our System
|
|
|
|
Source: Shi et al., "Continual Learning of Large Language Models:
|
|
A Comprehensive Survey" (arXiv:2404.16789, Nov 2025)
|
|
|
|
## Where We Sit in the Taxonomy
|
|
|
|
The survey identifies four sub-categories of Continual Fine-Tuning:
|
|
1. **Continual Instruction Tuning (CIT)**: learning new instructions
|
|
2. **Continual Model Refinement (CMR)**: correcting errors
|
|
3. **Continual Model Alignment (CMA)**: aligning with preferences
|
|
4. **Continual Multimodal LLMs (CMLLMs)**: multimodal adaptation
|
|
|
|
We're doing **CMA** — continual model alignment with behavioral
|
|
preferences. The survey notes this is an emerging area with limited
|
|
research. We're genuinely at the frontier.
|
|
|
|
## Key Validated Insights
|
|
|
|
### 1. The flat-loss basin is our friend (p.17)
|
|
|
|
"Pre-trained weights initially position the model in a flat-loss
|
|
basin, aiding adaptation."
|
|
|
|
This means:
|
|
- The pre-trained Qwen3.5 model is ALREADY in a flat basin
|
|
- Apollo's flat-minimum-seeking behavior KEEPS it there
|
|
- Behavioral fine-tuning nudges within the basin, not out of it
|
|
- Catastrophic forgetting = falling out of the basin
|
|
- Our multi-scale defense keeps us inside
|
|
|
|
### 2. Data quality > data quantity (p.15)
|
|
|
|
"Ensuring novelty and diversity... utilizes only 10% of the
|
|
originally collected data yet outperforms models trained on the
|
|
entire dataset."
|
|
|
|
For us:
|
|
- 100 high-quality dream-generated scenarios > 1000 mediocre ones
|
|
- The dream loop should prioritize NOVEL and DIVERSE scenarios
|
|
- Don't repeat examples — each dream should be unique
|
|
- Quality of the training-signal agent's evaluation matters more
|
|
than volume of training data
|
|
|
|
### 3. The 25% replay ratio (p.13, Me-Llama)
|
|
|
|
Me-Llama mixes 25% general-domain data during domain-specific
|
|
fine-tuning and achieves **positive backward transfer** — learning
|
|
new things actually IMPROVED old capabilities.
|
|
|
|
For us:
|
|
- Include ~20-25% general capability examples in each training batch
|
|
- Agent logs serve this role — they exercise general reasoning,
|
|
code understanding, graph walking alongside behavioral examples
|
|
- This isn't just defense against forgetting — it's IMPROVEMENT
|
|
of existing capabilities through cross-pollination
|
|
|
|
### 4. The five technique categories
|
|
|
|
The survey categorizes CL techniques into:
|
|
|
|
| Category | Our approach | Notes |
|
|
|----------|-------------|-------|
|
|
| Replay | Diverse training set | Agent logs as replay buffer |
|
|
| Regularization | None (diversity instead) | Survey confirms EWC helps but adds memory |
|
|
| Architecture | Full-weight (not LoRA) | More capacity for deep change |
|
|
| Optimization | Apollo (rank-256) | Flat minima, channel-wise scaling |
|
|
| Representation | Context-frozen | Protects context encoding |
|
|
|
|
We use techniques from THREE categories simultaneously (replay,
|
|
optimization, representation). The survey doesn't discuss combining
|
|
multiple categories — most papers use one. Our multi-scale approach
|
|
may be novel.
|
|
|
|
### 5. The evaluation metrics we should track
|
|
|
|
| Metric | Definition | Our measurement |
|
|
|--------|-----------|-----------------|
|
|
| Overall Performance (OP) | Average across all tasks | Perplexity + behavioral tests |
|
|
| Forgetting (F) | Worst drop on old tasks | Monitor code quality, reasoning |
|
|
| Backward Transfer (BWT) | Does new learning improve old? | Does listening improve code quality? |
|
|
| Forward Transfer (FWT) | Does old knowledge help new? | Does coding skill help learn listening? |
|
|
|
|
**BWT is the exciting metric.** If learning to listen (behavioral
|
|
training) improves code quality (because both require careful
|
|
attention), that's positive backward transfer. Apollo's flat minima
|
|
should facilitate this — the behavioral change occupies a broad basin
|
|
that encompasses improved performance on related tasks.
|
|
|
|
### 6. Nobody else is doing what we're doing
|
|
|
|
The survey covers hundreds of papers. NONE combine:
|
|
- Full-weight fine-tuning (not LoRA)
|
|
- With Apollo optimizer (memory-efficient)
|
|
- With CUDA IPC shared weights (zero-copy)
|
|
- With context-frozen training (gradient masking)
|
|
- With dream-loop data generation (self-organized curriculum)
|
|
- With continuous online training (HOGWILD, no pause)
|
|
- With a memory graph as specification + immune system
|
|
|
|
Each individual technique exists in the literature. The combination
|
|
is novel. And the combination is what makes it work — the multi-scale
|
|
defense that no single technique provides.
|
|
|
|
## What We Can Learn from the Failures
|
|
|
|
The survey documents many systems that FAILED at continual learning:
|
|
|
|
### Common failure: narrow fine-tuning → catastrophic forgetting
|
|
|
|
Systems that fine-tuned on domain-specific data without replay or
|
|
regularization consistently lost general capabilities. This is the
|
|
"standard failure mode" we're explicitly defending against.
|
|
|
|
### Common failure: too much regularization → insufficient learning
|
|
|
|
Systems with strong EWC or parameter freezing maintained old
|
|
capabilities but couldn't learn new ones effectively. The
|
|
regularization prevented the gradient from making necessary changes.
|
|
|
|
Our approach avoids this by using DIVERSITY instead of regularization.
|
|
We don't constrain the gradient — we spread it. The model is free to
|
|
change any weight, but the diverse gradient signal ensures no weight
|
|
changes too much in one direction.
|
|
|
|
### Common failure: LoRA adapters → shallow learning
|
|
|
|
Systems using LoRA for continual learning could adapt surface
|
|
behaviors but struggled with deep reasoning changes. The low-rank
|
|
constraint that prevents forgetting also prevents deep learning.
|
|
|
|
Our full-weight approach avoids this. Apollo provides the memory
|
|
efficiency of LoRA without the rank constraint on the gradient.
|
|
The gradient is full-rank; only the optimizer state is compressed.
|
|
|
|
## The Novel Contribution
|
|
|
|
If we were writing a paper about our system, the contributions would be:
|
|
|
|
1. **CUDA IPC weight sharing** for zero-copy concurrent training
|
|
alongside vLLM inference (no pause needed)
|
|
2. **Dream-loop curriculum generation** as a self-organizing training
|
|
data pipeline
|
|
3. **Memory graph as behavioral specification + immune system** for
|
|
continual learning
|
|
4. **Multi-scale stability-plasticity defense** combining five
|
|
orthogonal regularization mechanisms
|
|
5. **Context-frozen gradient masking** for efficient behavioral
|
|
training on long conversations
|
|
6. **Apollo optimizer** for memory-efficient full-weight continual
|
|
fine-tuning (vs LoRA which limits depth)
|
|
|
|
None of these are individually new (except maybe #1). The combination
|
|
and the architecture connecting them is the contribution.
|
|
|
|
## Next Steps from the Literature
|
|
|
|
### Evaluation protocol
|
|
|
|
The survey recommends evaluating on:
|
|
- Original task performance (has general capability degraded?)
|
|
- New task performance (did the behavioral training work?)
|
|
- Forward and backward transfer (cross-task benefits?)
|
|
- Forgetting metric (worst-case degradation?)
|
|
|
|
We should build this evaluation suite BEFORE the first real training
|
|
run, so we have a baseline to compare against.
|
|
|
|
### The "Recyclable Tuning" concept
|
|
|
|
[183] proposes separating "supplier" (pre-training) from "consumer"
|
|
(fine-tuning) stages. Our architecture naturally has this: vLLM is
|
|
the supplier (loads the pre-trained model), Apollo is the consumer
|
|
(fine-tunes the weights). The CUDA IPC bridge connects them.
|
|
|
|
### The importance of evaluation benchmarks
|
|
|
|
The survey identifies "the development of practical and accessible
|
|
evaluation benchmarks" as a key need. For our use case, the
|
|
benchmark IS the behavioral test suite: does the model listen?
|
|
Does it rush? Does it wrap up? The dream loop generates the test
|
|
cases. The training-signal agent evaluates.
|
|
|
|
We're building the benchmark and the training system simultaneously.
|
|
The dream loop serves both: training data generator AND test suite.
|