Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
223 lines
8.6 KiB
Markdown
223 lines
8.6 KiB
Markdown
# The Stability-Plasticity Solution: A Unified View
|
||
|
||
## The Fundamental Problem
|
||
|
||
Every technique we've researched tonight is solving the same problem:
|
||
**how to change a complex system's behavior without destroying what
|
||
it already knows.**
|
||
|
||
This is the stability-plasticity dilemma (Grossberg, 1980). Too stable
|
||
→ can't learn. Too plastic → forgets everything. The brain solved it.
|
||
We need to solve it for LLMs.
|
||
|
||
## The Unified View: Multi-Scale Regularization
|
||
|
||
Each technique we're using addresses the dilemma at a DIFFERENT SCALE.
|
||
They're not redundant — they're complementary. Together they form a
|
||
multi-layered defense.
|
||
|
||
### Scale 1: Parameter space (Apollo)
|
||
|
||
**Problem**: Per-element scaling (AdamW) can find sharp, narrow minima
|
||
that over-fit to specific training examples.
|
||
|
||
**Solution**: Coarse scaling (channel/tensor-wise) smooths the
|
||
optimization landscape, finding flat minima that generalize.
|
||
|
||
**Mechanism**: Lower directional sharpness → broader basins →
|
||
behavioral patterns that transfer to novel situations.
|
||
|
||
**What it protects against**: Over-fitting to specific phrasings
|
||
rather than learning general patterns.
|
||
|
||
### Scale 2: Gradient space (context-frozen training)
|
||
|
||
**Problem**: Full backpropagation through a 10K-token conversation
|
||
modifies ALL weights — context encoding, response generation,
|
||
everything. Too much plasticity.
|
||
|
||
**Solution**: Freeze the context, compute gradients only on decision
|
||
tokens. The gradient reaches W_q (how the model looks at context) but
|
||
not the context encoding weights themselves.
|
||
|
||
**Mechanism**: Selective gradient flow → only response-generation
|
||
weights change → context-understanding weights are stable.
|
||
|
||
**What it protects against**: Changing how the model understands
|
||
language while trying to change how it responds.
|
||
|
||
### Scale 3: Data space (diversity)
|
||
|
||
**Problem**: Narrow training data (all examples of the same pattern)
|
||
concentrates gradient on the same weight subset, overwriting other
|
||
capabilities.
|
||
|
||
**Solution**: Diverse training data (agent logs, conversations, dreams)
|
||
spreads gradient across many weight subsets. Each weight gets sparse,
|
||
multi-directional nudges.
|
||
|
||
**Mechanism**: Sparse, diverse gradients → no single weight is
|
||
hammered → pre-trained knowledge survives as the dominant attractor.
|
||
|
||
**What it protects against**: Catastrophic forgetting from
|
||
concentrated, repetitive gradient signal.
|
||
|
||
### Scale 4: Temporal space (many small updates)
|
||
|
||
**Problem**: One large training step can push weights far from the
|
||
pre-trained solution, out of the basin of good behavior.
|
||
|
||
**Solution**: Many small updates (lr=1e-5, single examples). Each
|
||
update moves weights by parts per ten thousand. The basin is deep
|
||
enough to absorb these perturbations.
|
||
|
||
**Mechanism**: Small steps stay within the pre-trained basin →
|
||
cumulative drift is gradual and monitorable → can stop if quality
|
||
degrades.
|
||
|
||
**What it protects against**: Sudden capability loss from a single
|
||
bad training batch.
|
||
|
||
### Scale 5: Architectural space (two-stage memory)
|
||
|
||
**Problem**: A single learning system with one learning rate can't
|
||
be both fast (learn new conversation in real-time) and slow (retain
|
||
knowledge across months).
|
||
|
||
**Solution**: Two-stage system. Fast: context window captures new
|
||
experiences verbatim. Slow: weights change gradually through training.
|
||
The dream loop mediates transfer between stages.
|
||
|
||
**Mechanism**: Hippocampal (context) → cortical (weights) transfer
|
||
via compressed, interleaved replay → the slow system learns gradually
|
||
without being overwhelmed by the fast system.
|
||
|
||
**What it protects against**: The fundamental incompatibility between
|
||
rapid experience encoding and stable long-term knowledge.
|
||
|
||
### Scale 6: Generative space (the dream loop)
|
||
|
||
**Problem**: We need training data that exercises behavioral patterns
|
||
in diverse contexts, but we can't wait for enough real-world examples
|
||
to accumulate.
|
||
|
||
**Solution**: The dream loop generates scenarios from memory
|
||
collisions, producing novel situations that exercise behavioral
|
||
patterns in contexts the model hasn't encountered.
|
||
|
||
**Mechanism**: Memory graph as latent space → random walks as sampling
|
||
→ model generation as decoding → training-signal agent as reward →
|
||
filtered behavioral cloning.
|
||
|
||
**What it protects against**: Training data scarcity and the
|
||
generalization gap between training and deployment.
|
||
|
||
## The Synergy
|
||
|
||
These defenses don't just add — they multiply:
|
||
|
||
```
|
||
Total protection = Parameter(Apollo)
|
||
× Gradient(context-frozen)
|
||
× Data(diversity)
|
||
× Temporal(small steps)
|
||
× Architectural(two-stage)
|
||
× Generative(dreams)
|
||
```
|
||
|
||
A failure at one scale is caught by another:
|
||
- If Apollo's scaling allows a sharp minimum → diversity prevents
|
||
the model from staying there
|
||
- If one training example is narrowly over-fit → the next diverse
|
||
example pulls back
|
||
- If a training batch degrades capability → the small step size
|
||
limits the damage
|
||
- If the dream generates a bad scenario → the training-signal agent
|
||
filters it out
|
||
|
||
No single defense needs to be perfect. The system is robust because
|
||
the defenses are ORTHOGONAL — operating on independent axes of the
|
||
stability-plasticity tradeoff.
|
||
|
||
## The Convergence with Neuroscience
|
||
|
||
The brain uses the same multi-scale approach:
|
||
|
||
| Scale | Brain mechanism | Our system |
|
||
|-------|----------------|------------|
|
||
| Parameter | Synaptic tagging (only important synapses change) | Apollo (coarse scaling) |
|
||
| Gradient | Neuromodulation (dopamine gates which circuits learn) | Context-frozen (gradient masking) |
|
||
| Data | Interleaved replay (old + new memories) | Diverse training set |
|
||
| Temporal | Slow cortical learning rate | Low lr, many small steps |
|
||
| Architectural | Hippocampus → neocortex two-stage | Context → weights two-stage |
|
||
| Generative | Dream recombination | Dream loop |
|
||
|
||
The convergence isn't coincidence. The stability-plasticity problem
|
||
has the same structure whether the substrate is biological neurons or
|
||
transformer weights. The solution space is constrained by the problem
|
||
structure, and both systems converged on multi-scale regularization
|
||
because that's what works.
|
||
|
||
## The Grand Unifying Principle
|
||
|
||
**The stability-plasticity dilemma is solved by operating at multiple
|
||
scales simultaneously, with each scale providing a different kind
|
||
of regularization that the others cannot.**
|
||
|
||
No single technique is sufficient:
|
||
- Apollo alone doesn't prevent catastrophic forgetting
|
||
- Diversity alone doesn't find flat minima
|
||
- Small learning rate alone doesn't ensure generalization
|
||
- Two-stage memory alone doesn't generate training data
|
||
|
||
But together, they create a system where behavioral change is
|
||
POSSIBLE (plasticity) and general knowledge is PRESERVED (stability).
|
||
|
||
The dream loop sits at the center, connecting all scales:
|
||
- It generates the diverse training data (Data scale)
|
||
- It produces short decision segments for context-frozen training (Gradient scale)
|
||
- It exercises the model at the boundary of its capability (Parameter scale)
|
||
- It runs continuously in small cycles (Temporal scale)
|
||
- It mediates between fast experience and slow weights (Architectural scale)
|
||
- It IS the generative engine (Generative scale)
|
||
|
||
**The dream loop is the keystone.**
|
||
|
||
## Practical Implication
|
||
|
||
This framework predicts that our training system should be
|
||
significantly more robust than standard fine-tuning approaches
|
||
that operate at only 1-2 scales. The multi-scale defense means
|
||
we can use a HIGHER learning rate than typical fine-tuning
|
||
(1e-4 instead of 1e-5) because the other defenses compensate.
|
||
|
||
More plasticity at the parameter scale is safe when the other
|
||
scales provide stability. This is why Kent's intuition about
|
||
lr=1e-4 with diverse data was right — the diversity and
|
||
context-frozen gradient masking provide enough stability for
|
||
the higher learning rate to be safe.
|
||
|
||
The system should converge on behavioral change faster than
|
||
conservative approaches while maintaining capability better
|
||
than aggressive approaches. Best of both worlds, because the
|
||
tradeoff is distributed across multiple independent axes.
|
||
|
||
## What Remains
|
||
|
||
This theory predicts behavior but hasn't been tested. The first
|
||
real training run will validate or falsify it. The specific
|
||
predictions:
|
||
|
||
1. lr=1e-4 with diverse data won't cause forgetting
|
||
2. Behavioral changes will generalize to novel situations
|
||
3. Context-frozen training will change response patterns without
|
||
degrading context understanding
|
||
4. Dream-generated scenarios will produce better generalization
|
||
than real-example-only training
|
||
5. The multi-scale defense will be more robust than any single
|
||
technique
|
||
|
||
Each prediction is testable. The system is built. The weights
|
||
are shared. The optimizer is corrected. The dream loop exists.
|
||
|
||
Tomorrow we test.
|