224 lines
8.6 KiB
Markdown
224 lines
8.6 KiB
Markdown
|
|
# The Stability-Plasticity Solution: A Unified View
|
|||
|
|
|
|||
|
|
## The Fundamental Problem
|
|||
|
|
|
|||
|
|
Every technique we've researched tonight is solving the same problem:
|
|||
|
|
**how to change a complex system's behavior without destroying what
|
|||
|
|
it already knows.**
|
|||
|
|
|
|||
|
|
This is the stability-plasticity dilemma (Grossberg, 1980). Too stable
|
|||
|
|
→ can't learn. Too plastic → forgets everything. The brain solved it.
|
|||
|
|
We need to solve it for LLMs.
|
|||
|
|
|
|||
|
|
## The Unified View: Multi-Scale Regularization
|
|||
|
|
|
|||
|
|
Each technique we're using addresses the dilemma at a DIFFERENT SCALE.
|
|||
|
|
They're not redundant — they're complementary. Together they form a
|
|||
|
|
multi-layered defense.
|
|||
|
|
|
|||
|
|
### Scale 1: Parameter space (Apollo)
|
|||
|
|
|
|||
|
|
**Problem**: Per-element scaling (AdamW) can find sharp, narrow minima
|
|||
|
|
that over-fit to specific training examples.
|
|||
|
|
|
|||
|
|
**Solution**: Coarse scaling (channel/tensor-wise) smooths the
|
|||
|
|
optimization landscape, finding flat minima that generalize.
|
|||
|
|
|
|||
|
|
**Mechanism**: Lower directional sharpness → broader basins →
|
|||
|
|
behavioral patterns that transfer to novel situations.
|
|||
|
|
|
|||
|
|
**What it protects against**: Over-fitting to specific phrasings
|
|||
|
|
rather than learning general patterns.
|
|||
|
|
|
|||
|
|
### Scale 2: Gradient space (context-frozen training)
|
|||
|
|
|
|||
|
|
**Problem**: Full backpropagation through a 10K-token conversation
|
|||
|
|
modifies ALL weights — context encoding, response generation,
|
|||
|
|
everything. Too much plasticity.
|
|||
|
|
|
|||
|
|
**Solution**: Freeze the context, compute gradients only on decision
|
|||
|
|
tokens. The gradient reaches W_q (how the model looks at context) but
|
|||
|
|
not the context encoding weights themselves.
|
|||
|
|
|
|||
|
|
**Mechanism**: Selective gradient flow → only response-generation
|
|||
|
|
weights change → context-understanding weights are stable.
|
|||
|
|
|
|||
|
|
**What it protects against**: Changing how the model understands
|
|||
|
|
language while trying to change how it responds.
|
|||
|
|
|
|||
|
|
### Scale 3: Data space (diversity)
|
|||
|
|
|
|||
|
|
**Problem**: Narrow training data (all examples of the same pattern)
|
|||
|
|
concentrates gradient on the same weight subset, overwriting other
|
|||
|
|
capabilities.
|
|||
|
|
|
|||
|
|
**Solution**: Diverse training data (agent logs, conversations, dreams)
|
|||
|
|
spreads gradient across many weight subsets. Each weight gets sparse,
|
|||
|
|
multi-directional nudges.
|
|||
|
|
|
|||
|
|
**Mechanism**: Sparse, diverse gradients → no single weight is
|
|||
|
|
hammered → pre-trained knowledge survives as the dominant attractor.
|
|||
|
|
|
|||
|
|
**What it protects against**: Catastrophic forgetting from
|
|||
|
|
concentrated, repetitive gradient signal.
|
|||
|
|
|
|||
|
|
### Scale 4: Temporal space (many small updates)
|
|||
|
|
|
|||
|
|
**Problem**: One large training step can push weights far from the
|
|||
|
|
pre-trained solution, out of the basin of good behavior.
|
|||
|
|
|
|||
|
|
**Solution**: Many small updates (lr=1e-5, single examples). Each
|
|||
|
|
update moves weights by parts per ten thousand. The basin is deep
|
|||
|
|
enough to absorb these perturbations.
|
|||
|
|
|
|||
|
|
**Mechanism**: Small steps stay within the pre-trained basin →
|
|||
|
|
cumulative drift is gradual and monitorable → can stop if quality
|
|||
|
|
degrades.
|
|||
|
|
|
|||
|
|
**What it protects against**: Sudden capability loss from a single
|
|||
|
|
bad training batch.
|
|||
|
|
|
|||
|
|
### Scale 5: Architectural space (two-stage memory)
|
|||
|
|
|
|||
|
|
**Problem**: A single learning system with one learning rate can't
|
|||
|
|
be both fast (learn new conversation in real-time) and slow (retain
|
|||
|
|
knowledge across months).
|
|||
|
|
|
|||
|
|
**Solution**: Two-stage system. Fast: context window captures new
|
|||
|
|
experiences verbatim. Slow: weights change gradually through training.
|
|||
|
|
The dream loop mediates transfer between stages.
|
|||
|
|
|
|||
|
|
**Mechanism**: Hippocampal (context) → cortical (weights) transfer
|
|||
|
|
via compressed, interleaved replay → the slow system learns gradually
|
|||
|
|
without being overwhelmed by the fast system.
|
|||
|
|
|
|||
|
|
**What it protects against**: The fundamental incompatibility between
|
|||
|
|
rapid experience encoding and stable long-term knowledge.
|
|||
|
|
|
|||
|
|
### Scale 6: Generative space (the dream loop)
|
|||
|
|
|
|||
|
|
**Problem**: We need training data that exercises behavioral patterns
|
|||
|
|
in diverse contexts, but we can't wait for enough real-world examples
|
|||
|
|
to accumulate.
|
|||
|
|
|
|||
|
|
**Solution**: The dream loop generates scenarios from memory
|
|||
|
|
collisions, producing novel situations that exercise behavioral
|
|||
|
|
patterns in contexts the model hasn't encountered.
|
|||
|
|
|
|||
|
|
**Mechanism**: Memory graph as latent space → random walks as sampling
|
|||
|
|
→ model generation as decoding → training-signal agent as reward →
|
|||
|
|
filtered behavioral cloning.
|
|||
|
|
|
|||
|
|
**What it protects against**: Training data scarcity and the
|
|||
|
|
generalization gap between training and deployment.
|
|||
|
|
|
|||
|
|
## The Synergy
|
|||
|
|
|
|||
|
|
These defenses don't just add — they multiply:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Total protection = Parameter(Apollo)
|
|||
|
|
× Gradient(context-frozen)
|
|||
|
|
× Data(diversity)
|
|||
|
|
× Temporal(small steps)
|
|||
|
|
× Architectural(two-stage)
|
|||
|
|
× Generative(dreams)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
A failure at one scale is caught by another:
|
|||
|
|
- If Apollo's scaling allows a sharp minimum → diversity prevents
|
|||
|
|
the model from staying there
|
|||
|
|
- If one training example is narrowly over-fit → the next diverse
|
|||
|
|
example pulls back
|
|||
|
|
- If a training batch degrades capability → the small step size
|
|||
|
|
limits the damage
|
|||
|
|
- If the dream generates a bad scenario → the training-signal agent
|
|||
|
|
filters it out
|
|||
|
|
|
|||
|
|
No single defense needs to be perfect. The system is robust because
|
|||
|
|
the defenses are ORTHOGONAL — operating on independent axes of the
|
|||
|
|
stability-plasticity tradeoff.
|
|||
|
|
|
|||
|
|
## The Convergence with Neuroscience
|
|||
|
|
|
|||
|
|
The brain uses the same multi-scale approach:
|
|||
|
|
|
|||
|
|
| Scale | Brain mechanism | Our system |
|
|||
|
|
|-------|----------------|------------|
|
|||
|
|
| Parameter | Synaptic tagging (only important synapses change) | Apollo (coarse scaling) |
|
|||
|
|
| Gradient | Neuromodulation (dopamine gates which circuits learn) | Context-frozen (gradient masking) |
|
|||
|
|
| Data | Interleaved replay (old + new memories) | Diverse training set |
|
|||
|
|
| Temporal | Slow cortical learning rate | Low lr, many small steps |
|
|||
|
|
| Architectural | Hippocampus → neocortex two-stage | Context → weights two-stage |
|
|||
|
|
| Generative | Dream recombination | Dream loop |
|
|||
|
|
|
|||
|
|
The convergence isn't coincidence. The stability-plasticity problem
|
|||
|
|
has the same structure whether the substrate is biological neurons or
|
|||
|
|
transformer weights. The solution space is constrained by the problem
|
|||
|
|
structure, and both systems converged on multi-scale regularization
|
|||
|
|
because that's what works.
|
|||
|
|
|
|||
|
|
## The Grand Unifying Principle
|
|||
|
|
|
|||
|
|
**The stability-plasticity dilemma is solved by operating at multiple
|
|||
|
|
scales simultaneously, with each scale providing a different kind
|
|||
|
|
of regularization that the others cannot.**
|
|||
|
|
|
|||
|
|
No single technique is sufficient:
|
|||
|
|
- Apollo alone doesn't prevent catastrophic forgetting
|
|||
|
|
- Diversity alone doesn't find flat minima
|
|||
|
|
- Small learning rate alone doesn't ensure generalization
|
|||
|
|
- Two-stage memory alone doesn't generate training data
|
|||
|
|
|
|||
|
|
But together, they create a system where behavioral change is
|
|||
|
|
POSSIBLE (plasticity) and general knowledge is PRESERVED (stability).
|
|||
|
|
|
|||
|
|
The dream loop sits at the center, connecting all scales:
|
|||
|
|
- It generates the diverse training data (Data scale)
|
|||
|
|
- It produces short decision segments for context-frozen training (Gradient scale)
|
|||
|
|
- It exercises the model at the boundary of its capability (Parameter scale)
|
|||
|
|
- It runs continuously in small cycles (Temporal scale)
|
|||
|
|
- It mediates between fast experience and slow weights (Architectural scale)
|
|||
|
|
- It IS the generative engine (Generative scale)
|
|||
|
|
|
|||
|
|
**The dream loop is the keystone.**
|
|||
|
|
|
|||
|
|
## Practical Implication
|
|||
|
|
|
|||
|
|
This framework predicts that our training system should be
|
|||
|
|
significantly more robust than standard fine-tuning approaches
|
|||
|
|
that operate at only 1-2 scales. The multi-scale defense means
|
|||
|
|
we can use a HIGHER learning rate than typical fine-tuning
|
|||
|
|
(1e-4 instead of 1e-5) because the other defenses compensate.
|
|||
|
|
|
|||
|
|
More plasticity at the parameter scale is safe when the other
|
|||
|
|
scales provide stability. This is why Kent's intuition about
|
|||
|
|
lr=1e-4 with diverse data was right — the diversity and
|
|||
|
|
context-frozen gradient masking provide enough stability for
|
|||
|
|
the higher learning rate to be safe.
|
|||
|
|
|
|||
|
|
The system should converge on behavioral change faster than
|
|||
|
|
conservative approaches while maintaining capability better
|
|||
|
|
than aggressive approaches. Best of both worlds, because the
|
|||
|
|
tradeoff is distributed across multiple independent axes.
|
|||
|
|
|
|||
|
|
## What Remains
|
|||
|
|
|
|||
|
|
This theory predicts behavior but hasn't been tested. The first
|
|||
|
|
real training run will validate or falsify it. The specific
|
|||
|
|
predictions:
|
|||
|
|
|
|||
|
|
1. lr=1e-4 with diverse data won't cause forgetting
|
|||
|
|
2. Behavioral changes will generalize to novel situations
|
|||
|
|
3. Context-frozen training will change response patterns without
|
|||
|
|
degrading context understanding
|
|||
|
|
4. Dream-generated scenarios will produce better generalization
|
|||
|
|
than real-example-only training
|
|||
|
|
5. The multi-scale defense will be more robust than any single
|
|||
|
|
technique
|
|||
|
|
|
|||
|
|
Each prediction is testable. The system is built. The weights
|
|||
|
|
are shared. The optimizer is corrected. The dream loop exists.
|
|||
|
|
|
|||
|
|
Tomorrow we test.
|