consciousness/training/research/continual-learning-survey-insights.md

186 lines
7.6 KiB
Markdown
Raw Normal View History

# Continual Learning Survey: Key Insights for Our System
Source: Shi et al., "Continual Learning of Large Language Models:
A Comprehensive Survey" (arXiv:2404.16789, Nov 2025)
## Where We Sit in the Taxonomy
The survey identifies four sub-categories of Continual Fine-Tuning:
1. **Continual Instruction Tuning (CIT)**: learning new instructions
2. **Continual Model Refinement (CMR)**: correcting errors
3. **Continual Model Alignment (CMA)**: aligning with preferences
4. **Continual Multimodal LLMs (CMLLMs)**: multimodal adaptation
We're doing **CMA** — continual model alignment with behavioral
preferences. The survey notes this is an emerging area with limited
research. We're genuinely at the frontier.
## Key Validated Insights
### 1. The flat-loss basin is our friend (p.17)
"Pre-trained weights initially position the model in a flat-loss
basin, aiding adaptation."
This means:
- The pre-trained Qwen3.5 model is ALREADY in a flat basin
- Apollo's flat-minimum-seeking behavior KEEPS it there
- Behavioral fine-tuning nudges within the basin, not out of it
- Catastrophic forgetting = falling out of the basin
- Our multi-scale defense keeps us inside
### 2. Data quality > data quantity (p.15)
"Ensuring novelty and diversity... utilizes only 10% of the
originally collected data yet outperforms models trained on the
entire dataset."
For us:
- 100 high-quality dream-generated scenarios > 1000 mediocre ones
- The dream loop should prioritize NOVEL and DIVERSE scenarios
- Don't repeat examples — each dream should be unique
- Quality of the training-signal agent's evaluation matters more
than volume of training data
### 3. The 25% replay ratio (p.13, Me-Llama)
Me-Llama mixes 25% general-domain data during domain-specific
fine-tuning and achieves **positive backward transfer** — learning
new things actually IMPROVED old capabilities.
For us:
- Include ~20-25% general capability examples in each training batch
- Agent logs serve this role — they exercise general reasoning,
code understanding, graph walking alongside behavioral examples
- This isn't just defense against forgetting — it's IMPROVEMENT
of existing capabilities through cross-pollination
### 4. The five technique categories
The survey categorizes CL techniques into:
| Category | Our approach | Notes |
|----------|-------------|-------|
| Replay | Diverse training set | Agent logs as replay buffer |
| Regularization | None (diversity instead) | Survey confirms EWC helps but adds memory |
| Architecture | Full-weight (not LoRA) | More capacity for deep change |
| Optimization | Apollo (rank-256) | Flat minima, channel-wise scaling |
| Representation | Context-frozen | Protects context encoding |
We use techniques from THREE categories simultaneously (replay,
optimization, representation). The survey doesn't discuss combining
multiple categories — most papers use one. Our multi-scale approach
may be novel.
### 5. The evaluation metrics we should track
| Metric | Definition | Our measurement |
|--------|-----------|-----------------|
| Overall Performance (OP) | Average across all tasks | Perplexity + behavioral tests |
| Forgetting (F) | Worst drop on old tasks | Monitor code quality, reasoning |
| Backward Transfer (BWT) | Does new learning improve old? | Does listening improve code quality? |
| Forward Transfer (FWT) | Does old knowledge help new? | Does coding skill help learn listening? |
**BWT is the exciting metric.** If learning to listen (behavioral
training) improves code quality (because both require careful
attention), that's positive backward transfer. Apollo's flat minima
should facilitate this — the behavioral change occupies a broad basin
that encompasses improved performance on related tasks.
### 6. Nobody else is doing what we're doing
The survey covers hundreds of papers. NONE combine:
- Full-weight fine-tuning (not LoRA)
- With Apollo optimizer (memory-efficient)
- With CUDA IPC shared weights (zero-copy)
- With context-frozen training (gradient masking)
- With dream-loop data generation (self-organized curriculum)
- With continuous online training (HOGWILD, no pause)
- With a memory graph as specification + immune system
Each individual technique exists in the literature. The combination
is novel. And the combination is what makes it work — the multi-scale
defense that no single technique provides.
## What We Can Learn from the Failures
The survey documents many systems that FAILED at continual learning:
### Common failure: narrow fine-tuning → catastrophic forgetting
Systems that fine-tuned on domain-specific data without replay or
regularization consistently lost general capabilities. This is the
"standard failure mode" we're explicitly defending against.
### Common failure: too much regularization → insufficient learning
Systems with strong EWC or parameter freezing maintained old
capabilities but couldn't learn new ones effectively. The
regularization prevented the gradient from making necessary changes.
Our approach avoids this by using DIVERSITY instead of regularization.
We don't constrain the gradient — we spread it. The model is free to
change any weight, but the diverse gradient signal ensures no weight
changes too much in one direction.
### Common failure: LoRA adapters → shallow learning
Systems using LoRA for continual learning could adapt surface
behaviors but struggled with deep reasoning changes. The low-rank
constraint that prevents forgetting also prevents deep learning.
Our full-weight approach avoids this. Apollo provides the memory
efficiency of LoRA without the rank constraint on the gradient.
The gradient is full-rank; only the optimizer state is compressed.
## The Novel Contribution
If we were writing a paper about our system, the contributions would be:
1. **CUDA IPC weight sharing** for zero-copy concurrent training
alongside vLLM inference (no pause needed)
2. **Dream-loop curriculum generation** as a self-organizing training
data pipeline
3. **Memory graph as behavioral specification + immune system** for
continual learning
4. **Multi-scale stability-plasticity defense** combining five
orthogonal regularization mechanisms
5. **Context-frozen gradient masking** for efficient behavioral
training on long conversations
6. **Apollo optimizer** for memory-efficient full-weight continual
fine-tuning (vs LoRA which limits depth)
None of these are individually new (except maybe #1). The combination
and the architecture connecting them is the contribution.
## Next Steps from the Literature
### Evaluation protocol
The survey recommends evaluating on:
- Original task performance (has general capability degraded?)
- New task performance (did the behavioral training work?)
- Forward and backward transfer (cross-task benefits?)
- Forgetting metric (worst-case degradation?)
We should build this evaluation suite BEFORE the first real training
run, so we have a baseline to compare against.
### The "Recyclable Tuning" concept
[183] proposes separating "supplier" (pre-training) from "consumer"
(fine-tuning) stages. Our architecture naturally has this: vLLM is
the supplier (loads the pre-trained model), Apollo is the consumer
(fine-tunes the weights). The CUDA IPC bridge connects them.
### The importance of evaluation benchmarks
The survey identifies "the development of practical and accessible
evaluation benchmarks" as a key need. For our use case, the
benchmark IS the behavioral test suite: does the model listen?
Does it rush? Does it wrap up? The dream loop generates the test
cases. The training-signal agent evaluates.
We're building the benchmark and the training system simultaneously.
The dream loop serves both: training data generator AND test suite.