From ab61a502e47c4f0e836b3f0a4808296905c67897 Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 00:56:58 -0400 Subject: [PATCH] =?UTF-8?q?research:=20catastrophic=20forgetting=20analysi?= =?UTF-8?q?s=20=E2=80=94=20diversity=20is=20the=20primary=20defense?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- training/research/catastrophic-forgetting.md | 176 +++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 training/research/catastrophic-forgetting.md diff --git a/training/research/catastrophic-forgetting.md b/training/research/catastrophic-forgetting.md new file mode 100644 index 0000000..c7dd339 --- /dev/null +++ b/training/research/catastrophic-forgetting.md @@ -0,0 +1,176 @@ +# Catastrophic Forgetting in LLM Fine-Tuning + +## What it is + +When fine-tuning a pre-trained LLM on new data, the model's performance +on tasks it could previously do can degrade. "Catastrophic" because the +degradation can be sudden and severe — a few thousand training steps on +a narrow dataset can destroy broad capabilities built over trillions of +pre-training tokens. + +## Why it happens + +The gradient from fine-tuning data pushes weights away from the pre-trained +solution. If the fine-tuning signal is narrow (e.g., "always respond in +formal English"), it overwrites the broad patterns that enabled diverse +capabilities. The pre-trained weights encode a complex, multi-task +solution; fine-tuning replaces that with a simpler, narrower one. + +Formally: the pre-trained weights sit in a basin of the loss landscape +that's good for many tasks. Fine-tuning moves the weights toward a +different basin that's optimal for the fine-tuning task but bad for others. +The further you move, the more you forget. + +## Key factors that control forgetting + +### 1. Learning rate × number of steps = total displacement + +The total distance the weights move from their pre-trained values is: +``` +‖W_final - W_pretrained‖ ≈ lr × steps × avg_grad_norm +``` +More displacement = more forgetting. Both learning rate and step count +matter; their product determines the total weight change. + +**Typical safe ranges (from literature)**: +- Full fine-tuning: lr = 1e-5 to 5e-5, 1-3 epochs on ~1000-10000 examples +- LoRA: lr = 1e-4 to 3e-4 (higher because adapters start from zero) +- Apollo: lr = 1e-5 to 1e-4 (same as full fine-tuning, paper Section 5.2) + +### 2. Dataset diversity + +Narrow datasets cause more forgetting than diverse ones. Training on +1000 examples of the same pattern hammers the same weights repeatedly. +Training on 1000 diverse examples spreads the gradient across different +weight subsets. + +**This is our key defense.** Our training data includes: +- Agent logs (graph walking, linking, reasoning) +- Conversation transcripts (technical discussion, emotional engagement) +- Dream-generated scenarios (diverse behavioral situations) +- Personality patterns (voice, boundaries, mode awareness) + +The diversity means no single weight subset gets disproportionate updates. + +### 3. Gradient sparsity + +For a short training example (50-256 decision tokens), the gradient is +sparse — most weights get near-zero gradient because they weren't active +for that specific input. Only the weights that participated in generating +the decision tokens receive meaningful gradient signal. + +This natural sparsity is another defense: each example only modifies a +small fraction of the weights, leaving the rest untouched. + +### 4. Rank of the gradient + +Biderman et al. (2024, "LoRA Learns Less and Forgets Less") found that +full fine-tuning produces weight perturbations with effective rank +10-100× higher than typical LoRA configurations. Higher-rank perturbations +modify more independent directions in weight space, which increases +both learning AND forgetting. + +LoRA's low-rank constraint acts as implicit regularization against +forgetting — it can only modify a low-dimensional subspace of the +weights, leaving most of the pre-trained solution intact. + +**Apollo's rank-256 sits between LoRA and full fine-tuning.** The +projected optimizer constrains the scaling (not the gradient itself) +to 256 dimensions. The gradient itself is still full-rank. This means +Apollo modifies all weights (like full fine-tuning) but with a more +structured update pattern (like LoRA). Whether this provides forgetting +protection similar to LoRA is an open question. + +## Defenses against forgetting + +### 1. Diversity of training data (our primary defense) + +The most effective and simplest defense. If the training data covers +the same breadth as the pre-training data (proportionally), the model +maintains its broad capabilities while learning new patterns. + +For us: mix behavioral examples with general capability examples. +Include agent logs alongside conversation corrections. + +### 2. Low learning rate + few steps + +Keep the total weight displacement small. The pre-trained basin is +deep — small perturbations stay within it. + +For us: lr=1e-5 as starting point. One epoch over diverse data. +Monitor perplexity on held-out general text. + +### 3. Context-frozen training (our approach) + +By only computing gradients on decision tokens (50-256 tokens) rather +than full conversations (thousands of tokens), we limit the gradient +magnitude per example. The context contributes to the forward pass +(determining which weights are active) but not to the backward pass +(determining which weights change). + +This naturally limits the total gradient norm per training step. + +### 4. Elastic Weight Consolidation (EWC) + +Add a regularization term that penalizes changes to "important" weights: +``` +L_total = L_task + λ Σ_i F_i (θ_i - θ*_i)² +``` +where F_i is the Fisher information for parameter i and θ*_i is the +pre-trained value. Important weights (high Fisher information) are +penalized more for changing. + +**Drawback**: requires computing and storing the Fisher information +matrix (same size as the model). Not practical for 27B parameters. + +### 5. Replay / rehearsal + +Mix in examples from the pre-training distribution alongside fine-tuning +data. This maintains the original capabilities by continuing to train +on them. + +**Drawback**: requires access to pre-training data or a representative +subset. For open models like Qwen3.5, the pre-training data isn't +publicly available. Could use a proxy (e.g., Wikipedia, code). + +### 6. Apollo's implicit regularization + +The Apollo paper (Section 5.5) provides evidence that Apollo has lower +directional sharpness than AdamW, and its "SGD-like" update behavior +provides natural regularization. Table 10 shows Apollo/Apollo-Mini +achieve comparable or better directional sharpness to SGD. + +This suggests Apollo may be inherently more resistant to forgetting +than AdamW, though the paper doesn't test this directly. + +## Monitoring for forgetting + +### Perplexity on held-out data + +The simplest and most reliable metric. Compute perplexity on a diverse +held-out set (e.g., WikiText, a mix of code and natural language) before +and after training. If perplexity increases significantly, forgetting +is occurring. + +### Task-specific benchmarks + +Run a small suite of tasks (code generation, reasoning, general knowledge) +before and after each training session. Track scores over time. + +### Output quality spot-checks + +For our use case, the most relevant check: does the model still write +good code? Still reason about filesystem internals? Still maintain +conversation coherently? These are qualitative but immediately noticeable. + +## Practical recommendations for our system + +1. **Start with lr=1e-5, single epoch, diverse training set** +2. **Monitor perplexity on held-out text after each training session** +3. **Include 20-30% general capability examples alongside behavioral ones** +4. **Use context-frozen training to limit gradient magnitude** +5. **The dream loop generates diversity naturally — different scenarios + exercise different model capabilities** +6. **If forgetting is detected: reduce lr, increase data diversity, + or reduce training frequency** +7. **Keep the pre-trained checkpoint on moria as rollback safety net**