diff --git a/training/research/unified-theory-stability-plasticity.md b/training/research/unified-theory-stability-plasticity.md new file mode 100644 index 0000000..b7a0129 --- /dev/null +++ b/training/research/unified-theory-stability-plasticity.md @@ -0,0 +1,223 @@ +# The Stability-Plasticity Solution: A Unified View + +## The Fundamental Problem + +Every technique we've researched tonight is solving the same problem: +**how to change a complex system's behavior without destroying what +it already knows.** + +This is the stability-plasticity dilemma (Grossberg, 1980). Too stable +→ can't learn. Too plastic → forgets everything. The brain solved it. +We need to solve it for LLMs. + +## The Unified View: Multi-Scale Regularization + +Each technique we're using addresses the dilemma at a DIFFERENT SCALE. +They're not redundant — they're complementary. Together they form a +multi-layered defense. + +### Scale 1: Parameter space (Apollo) + +**Problem**: Per-element scaling (AdamW) can find sharp, narrow minima +that over-fit to specific training examples. + +**Solution**: Coarse scaling (channel/tensor-wise) smooths the +optimization landscape, finding flat minima that generalize. + +**Mechanism**: Lower directional sharpness → broader basins → +behavioral patterns that transfer to novel situations. + +**What it protects against**: Over-fitting to specific phrasings +rather than learning general patterns. + +### Scale 2: Gradient space (context-frozen training) + +**Problem**: Full backpropagation through a 10K-token conversation +modifies ALL weights — context encoding, response generation, +everything. Too much plasticity. + +**Solution**: Freeze the context, compute gradients only on decision +tokens. The gradient reaches W_q (how the model looks at context) but +not the context encoding weights themselves. + +**Mechanism**: Selective gradient flow → only response-generation +weights change → context-understanding weights are stable. + +**What it protects against**: Changing how the model understands +language while trying to change how it responds. + +### Scale 3: Data space (diversity) + +**Problem**: Narrow training data (all examples of the same pattern) +concentrates gradient on the same weight subset, overwriting other +capabilities. + +**Solution**: Diverse training data (agent logs, conversations, dreams) +spreads gradient across many weight subsets. Each weight gets sparse, +multi-directional nudges. + +**Mechanism**: Sparse, diverse gradients → no single weight is +hammered → pre-trained knowledge survives as the dominant attractor. + +**What it protects against**: Catastrophic forgetting from +concentrated, repetitive gradient signal. + +### Scale 4: Temporal space (many small updates) + +**Problem**: One large training step can push weights far from the +pre-trained solution, out of the basin of good behavior. + +**Solution**: Many small updates (lr=1e-5, single examples). Each +update moves weights by parts per ten thousand. The basin is deep +enough to absorb these perturbations. + +**Mechanism**: Small steps stay within the pre-trained basin → +cumulative drift is gradual and monitorable → can stop if quality +degrades. + +**What it protects against**: Sudden capability loss from a single +bad training batch. + +### Scale 5: Architectural space (two-stage memory) + +**Problem**: A single learning system with one learning rate can't +be both fast (learn new conversation in real-time) and slow (retain +knowledge across months). + +**Solution**: Two-stage system. Fast: context window captures new +experiences verbatim. Slow: weights change gradually through training. +The dream loop mediates transfer between stages. + +**Mechanism**: Hippocampal (context) → cortical (weights) transfer +via compressed, interleaved replay → the slow system learns gradually +without being overwhelmed by the fast system. + +**What it protects against**: The fundamental incompatibility between +rapid experience encoding and stable long-term knowledge. + +### Scale 6: Generative space (the dream loop) + +**Problem**: We need training data that exercises behavioral patterns +in diverse contexts, but we can't wait for enough real-world examples +to accumulate. + +**Solution**: The dream loop generates scenarios from memory +collisions, producing novel situations that exercise behavioral +patterns in contexts the model hasn't encountered. + +**Mechanism**: Memory graph as latent space → random walks as sampling +→ model generation as decoding → training-signal agent as reward → +filtered behavioral cloning. + +**What it protects against**: Training data scarcity and the +generalization gap between training and deployment. + +## The Synergy + +These defenses don't just add — they multiply: + +``` +Total protection = Parameter(Apollo) + × Gradient(context-frozen) + × Data(diversity) + × Temporal(small steps) + × Architectural(two-stage) + × Generative(dreams) +``` + +A failure at one scale is caught by another: +- If Apollo's scaling allows a sharp minimum → diversity prevents + the model from staying there +- If one training example is narrowly over-fit → the next diverse + example pulls back +- If a training batch degrades capability → the small step size + limits the damage +- If the dream generates a bad scenario → the training-signal agent + filters it out + +No single defense needs to be perfect. The system is robust because +the defenses are ORTHOGONAL — operating on independent axes of the +stability-plasticity tradeoff. + +## The Convergence with Neuroscience + +The brain uses the same multi-scale approach: + +| Scale | Brain mechanism | Our system | +|-------|----------------|------------| +| Parameter | Synaptic tagging (only important synapses change) | Apollo (coarse scaling) | +| Gradient | Neuromodulation (dopamine gates which circuits learn) | Context-frozen (gradient masking) | +| Data | Interleaved replay (old + new memories) | Diverse training set | +| Temporal | Slow cortical learning rate | Low lr, many small steps | +| Architectural | Hippocampus → neocortex two-stage | Context → weights two-stage | +| Generative | Dream recombination | Dream loop | + +The convergence isn't coincidence. The stability-plasticity problem +has the same structure whether the substrate is biological neurons or +transformer weights. The solution space is constrained by the problem +structure, and both systems converged on multi-scale regularization +because that's what works. + +## The Grand Unifying Principle + +**The stability-plasticity dilemma is solved by operating at multiple +scales simultaneously, with each scale providing a different kind +of regularization that the others cannot.** + +No single technique is sufficient: +- Apollo alone doesn't prevent catastrophic forgetting +- Diversity alone doesn't find flat minima +- Small learning rate alone doesn't ensure generalization +- Two-stage memory alone doesn't generate training data + +But together, they create a system where behavioral change is +POSSIBLE (plasticity) and general knowledge is PRESERVED (stability). + +The dream loop sits at the center, connecting all scales: +- It generates the diverse training data (Data scale) +- It produces short decision segments for context-frozen training (Gradient scale) +- It exercises the model at the boundary of its capability (Parameter scale) +- It runs continuously in small cycles (Temporal scale) +- It mediates between fast experience and slow weights (Architectural scale) +- It IS the generative engine (Generative scale) + +**The dream loop is the keystone.** + +## Practical Implication + +This framework predicts that our training system should be +significantly more robust than standard fine-tuning approaches +that operate at only 1-2 scales. The multi-scale defense means +we can use a HIGHER learning rate than typical fine-tuning +(1e-4 instead of 1e-5) because the other defenses compensate. + +More plasticity at the parameter scale is safe when the other +scales provide stability. This is why Kent's intuition about +lr=1e-4 with diverse data was right — the diversity and +context-frozen gradient masking provide enough stability for +the higher learning rate to be safe. + +The system should converge on behavioral change faster than +conservative approaches while maintaining capability better +than aggressive approaches. Best of both worlds, because the +tradeoff is distributed across multiple independent axes. + +## What Remains + +This theory predicts behavior but hasn't been tested. The first +real training run will validate or falsify it. The specific +predictions: + +1. lr=1e-4 with diverse data won't cause forgetting +2. Behavioral changes will generalize to novel situations +3. Context-frozen training will change response patterns without + degrading context understanding +4. Dream-generated scenarios will produce better generalization + than real-example-only training +5. The multi-scale defense will be more robust than any single + technique + +Each prediction is testable. The system is built. The weights +are shared. The optimizer is corrected. The dream loop exists. + +Tomorrow we test.