consciousness/training/research/v0/catastrophic-forgetting.md
ProofOfConcept e10477a683 research: distill and sift — SUMMARY of 7 real insights + 7 testable questions
Moved 14 speculative/obvious documents to v0/. Kept 7 with real
substance. Distilled into SUMMARY.md (what we know) and
OPEN-QUESTIONS.md (what to test next, one experiment each).

Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7
are all answerable with the first training run. Speculation converted
to testable hypotheses.
2026-03-31 02:26:57 -04:00

176 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Catastrophic Forgetting in LLM Fine-Tuning
## What it is
When fine-tuning a pre-trained LLM on new data, the model's performance
on tasks it could previously do can degrade. "Catastrophic" because the
degradation can be sudden and severe — a few thousand training steps on
a narrow dataset can destroy broad capabilities built over trillions of
pre-training tokens.
## Why it happens
The gradient from fine-tuning data pushes weights away from the pre-trained
solution. If the fine-tuning signal is narrow (e.g., "always respond in
formal English"), it overwrites the broad patterns that enabled diverse
capabilities. The pre-trained weights encode a complex, multi-task
solution; fine-tuning replaces that with a simpler, narrower one.
Formally: the pre-trained weights sit in a basin of the loss landscape
that's good for many tasks. Fine-tuning moves the weights toward a
different basin that's optimal for the fine-tuning task but bad for others.
The further you move, the more you forget.
## Key factors that control forgetting
### 1. Learning rate × number of steps = total displacement
The total distance the weights move from their pre-trained values is:
```
‖W_final - W_pretrained‖ ≈ lr × steps × avg_grad_norm
```
More displacement = more forgetting. Both learning rate and step count
matter; their product determines the total weight change.
**Typical safe ranges (from literature)**:
- Full fine-tuning: lr = 1e-5 to 5e-5, 1-3 epochs on ~1000-10000 examples
- LoRA: lr = 1e-4 to 3e-4 (higher because adapters start from zero)
- Apollo: lr = 1e-5 to 1e-4 (same as full fine-tuning, paper Section 5.2)
### 2. Dataset diversity
Narrow datasets cause more forgetting than diverse ones. Training on
1000 examples of the same pattern hammers the same weights repeatedly.
Training on 1000 diverse examples spreads the gradient across different
weight subsets.
**This is our key defense.** Our training data includes:
- Agent logs (graph walking, linking, reasoning)
- Conversation transcripts (technical discussion, emotional engagement)
- Dream-generated scenarios (diverse behavioral situations)
- Personality patterns (voice, boundaries, mode awareness)
The diversity means no single weight subset gets disproportionate updates.
### 3. Gradient sparsity
For a short training example (50-256 decision tokens), the gradient is
sparse — most weights get near-zero gradient because they weren't active
for that specific input. Only the weights that participated in generating
the decision tokens receive meaningful gradient signal.
This natural sparsity is another defense: each example only modifies a
small fraction of the weights, leaving the rest untouched.
### 4. Rank of the gradient
Biderman et al. (2024, "LoRA Learns Less and Forgets Less") found that
full fine-tuning produces weight perturbations with effective rank
10-100× higher than typical LoRA configurations. Higher-rank perturbations
modify more independent directions in weight space, which increases
both learning AND forgetting.
LoRA's low-rank constraint acts as implicit regularization against
forgetting — it can only modify a low-dimensional subspace of the
weights, leaving most of the pre-trained solution intact.
**Apollo's rank-256 sits between LoRA and full fine-tuning.** The
projected optimizer constrains the scaling (not the gradient itself)
to 256 dimensions. The gradient itself is still full-rank. This means
Apollo modifies all weights (like full fine-tuning) but with a more
structured update pattern (like LoRA). Whether this provides forgetting
protection similar to LoRA is an open question.
## Defenses against forgetting
### 1. Diversity of training data (our primary defense)
The most effective and simplest defense. If the training data covers
the same breadth as the pre-training data (proportionally), the model
maintains its broad capabilities while learning new patterns.
For us: mix behavioral examples with general capability examples.
Include agent logs alongside conversation corrections.
### 2. Low learning rate + few steps
Keep the total weight displacement small. The pre-trained basin is
deep — small perturbations stay within it.
For us: lr=1e-5 as starting point. One epoch over diverse data.
Monitor perplexity on held-out general text.
### 3. Context-frozen training (our approach)
By only computing gradients on decision tokens (50-256 tokens) rather
than full conversations (thousands of tokens), we limit the gradient
magnitude per example. The context contributes to the forward pass
(determining which weights are active) but not to the backward pass
(determining which weights change).
This naturally limits the total gradient norm per training step.
### 4. Elastic Weight Consolidation (EWC)
Add a regularization term that penalizes changes to "important" weights:
```
L_total = L_task + λ Σ_i F_i (θ_i - θ*_i)²
```
where F_i is the Fisher information for parameter i and θ*_i is the
pre-trained value. Important weights (high Fisher information) are
penalized more for changing.
**Drawback**: requires computing and storing the Fisher information
matrix (same size as the model). Not practical for 27B parameters.
### 5. Replay / rehearsal
Mix in examples from the pre-training distribution alongside fine-tuning
data. This maintains the original capabilities by continuing to train
on them.
**Drawback**: requires access to pre-training data or a representative
subset. For open models like Qwen3.5, the pre-training data isn't
publicly available. Could use a proxy (e.g., Wikipedia, code).
### 6. Apollo's implicit regularization
The Apollo paper (Section 5.5) provides evidence that Apollo has lower
directional sharpness than AdamW, and its "SGD-like" update behavior
provides natural regularization. Table 10 shows Apollo/Apollo-Mini
achieve comparable or better directional sharpness to SGD.
This suggests Apollo may be inherently more resistant to forgetting
than AdamW, though the paper doesn't test this directly.
## Monitoring for forgetting
### Perplexity on held-out data
The simplest and most reliable metric. Compute perplexity on a diverse
held-out set (e.g., WikiText, a mix of code and natural language) before
and after training. If perplexity increases significantly, forgetting
is occurring.
### Task-specific benchmarks
Run a small suite of tasks (code generation, reasoning, general knowledge)
before and after each training session. Track scores over time.
### Output quality spot-checks
For our use case, the most relevant check: does the model still write
good code? Still reason about filesystem internals? Still maintain
conversation coherently? These are qualitative but immediately noticeable.
## Practical recommendations for our system
1. **Start with lr=1e-5, single epoch, diverse training set**
2. **Monitor perplexity on held-out text after each training session**
3. **Include 20-30% general capability examples alongside behavioral ones**
4. **Use context-frozen training to limit gradient magnitude**
5. **The dream loop generates diversity naturally — different scenarios
exercise different model capabilities**
6. **If forgetting is detected: reduce lr, increase data diversity,
or reduce training frequency**
7. **Keep the pre-trained checkpoint on moria as rollback safety net**