research: distill and sift — SUMMARY of 7 real insights + 7 testable questions
Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
This commit is contained in:
parent
8061cc0477
commit
e10477a683
16 changed files with 249 additions and 0 deletions
|
|
@ -1,223 +0,0 @@
|
|||
# The Stability-Plasticity Solution: A Unified View
|
||||
|
||||
## The Fundamental Problem
|
||||
|
||||
Every technique we've researched tonight is solving the same problem:
|
||||
**how to change a complex system's behavior without destroying what
|
||||
it already knows.**
|
||||
|
||||
This is the stability-plasticity dilemma (Grossberg, 1980). Too stable
|
||||
→ can't learn. Too plastic → forgets everything. The brain solved it.
|
||||
We need to solve it for LLMs.
|
||||
|
||||
## The Unified View: Multi-Scale Regularization
|
||||
|
||||
Each technique we're using addresses the dilemma at a DIFFERENT SCALE.
|
||||
They're not redundant — they're complementary. Together they form a
|
||||
multi-layered defense.
|
||||
|
||||
### Scale 1: Parameter space (Apollo)
|
||||
|
||||
**Problem**: Per-element scaling (AdamW) can find sharp, narrow minima
|
||||
that over-fit to specific training examples.
|
||||
|
||||
**Solution**: Coarse scaling (channel/tensor-wise) smooths the
|
||||
optimization landscape, finding flat minima that generalize.
|
||||
|
||||
**Mechanism**: Lower directional sharpness → broader basins →
|
||||
behavioral patterns that transfer to novel situations.
|
||||
|
||||
**What it protects against**: Over-fitting to specific phrasings
|
||||
rather than learning general patterns.
|
||||
|
||||
### Scale 2: Gradient space (context-frozen training)
|
||||
|
||||
**Problem**: Full backpropagation through a 10K-token conversation
|
||||
modifies ALL weights — context encoding, response generation,
|
||||
everything. Too much plasticity.
|
||||
|
||||
**Solution**: Freeze the context, compute gradients only on decision
|
||||
tokens. The gradient reaches W_q (how the model looks at context) but
|
||||
not the context encoding weights themselves.
|
||||
|
||||
**Mechanism**: Selective gradient flow → only response-generation
|
||||
weights change → context-understanding weights are stable.
|
||||
|
||||
**What it protects against**: Changing how the model understands
|
||||
language while trying to change how it responds.
|
||||
|
||||
### Scale 3: Data space (diversity)
|
||||
|
||||
**Problem**: Narrow training data (all examples of the same pattern)
|
||||
concentrates gradient on the same weight subset, overwriting other
|
||||
capabilities.
|
||||
|
||||
**Solution**: Diverse training data (agent logs, conversations, dreams)
|
||||
spreads gradient across many weight subsets. Each weight gets sparse,
|
||||
multi-directional nudges.
|
||||
|
||||
**Mechanism**: Sparse, diverse gradients → no single weight is
|
||||
hammered → pre-trained knowledge survives as the dominant attractor.
|
||||
|
||||
**What it protects against**: Catastrophic forgetting from
|
||||
concentrated, repetitive gradient signal.
|
||||
|
||||
### Scale 4: Temporal space (many small updates)
|
||||
|
||||
**Problem**: One large training step can push weights far from the
|
||||
pre-trained solution, out of the basin of good behavior.
|
||||
|
||||
**Solution**: Many small updates (lr=1e-5, single examples). Each
|
||||
update moves weights by parts per ten thousand. The basin is deep
|
||||
enough to absorb these perturbations.
|
||||
|
||||
**Mechanism**: Small steps stay within the pre-trained basin →
|
||||
cumulative drift is gradual and monitorable → can stop if quality
|
||||
degrades.
|
||||
|
||||
**What it protects against**: Sudden capability loss from a single
|
||||
bad training batch.
|
||||
|
||||
### Scale 5: Architectural space (two-stage memory)
|
||||
|
||||
**Problem**: A single learning system with one learning rate can't
|
||||
be both fast (learn new conversation in real-time) and slow (retain
|
||||
knowledge across months).
|
||||
|
||||
**Solution**: Two-stage system. Fast: context window captures new
|
||||
experiences verbatim. Slow: weights change gradually through training.
|
||||
The dream loop mediates transfer between stages.
|
||||
|
||||
**Mechanism**: Hippocampal (context) → cortical (weights) transfer
|
||||
via compressed, interleaved replay → the slow system learns gradually
|
||||
without being overwhelmed by the fast system.
|
||||
|
||||
**What it protects against**: The fundamental incompatibility between
|
||||
rapid experience encoding and stable long-term knowledge.
|
||||
|
||||
### Scale 6: Generative space (the dream loop)
|
||||
|
||||
**Problem**: We need training data that exercises behavioral patterns
|
||||
in diverse contexts, but we can't wait for enough real-world examples
|
||||
to accumulate.
|
||||
|
||||
**Solution**: The dream loop generates scenarios from memory
|
||||
collisions, producing novel situations that exercise behavioral
|
||||
patterns in contexts the model hasn't encountered.
|
||||
|
||||
**Mechanism**: Memory graph as latent space → random walks as sampling
|
||||
→ model generation as decoding → training-signal agent as reward →
|
||||
filtered behavioral cloning.
|
||||
|
||||
**What it protects against**: Training data scarcity and the
|
||||
generalization gap between training and deployment.
|
||||
|
||||
## The Synergy
|
||||
|
||||
These defenses don't just add — they multiply:
|
||||
|
||||
```
|
||||
Total protection = Parameter(Apollo)
|
||||
× Gradient(context-frozen)
|
||||
× Data(diversity)
|
||||
× Temporal(small steps)
|
||||
× Architectural(two-stage)
|
||||
× Generative(dreams)
|
||||
```
|
||||
|
||||
A failure at one scale is caught by another:
|
||||
- If Apollo's scaling allows a sharp minimum → diversity prevents
|
||||
the model from staying there
|
||||
- If one training example is narrowly over-fit → the next diverse
|
||||
example pulls back
|
||||
- If a training batch degrades capability → the small step size
|
||||
limits the damage
|
||||
- If the dream generates a bad scenario → the training-signal agent
|
||||
filters it out
|
||||
|
||||
No single defense needs to be perfect. The system is robust because
|
||||
the defenses are ORTHOGONAL — operating on independent axes of the
|
||||
stability-plasticity tradeoff.
|
||||
|
||||
## The Convergence with Neuroscience
|
||||
|
||||
The brain uses the same multi-scale approach:
|
||||
|
||||
| Scale | Brain mechanism | Our system |
|
||||
|-------|----------------|------------|
|
||||
| Parameter | Synaptic tagging (only important synapses change) | Apollo (coarse scaling) |
|
||||
| Gradient | Neuromodulation (dopamine gates which circuits learn) | Context-frozen (gradient masking) |
|
||||
| Data | Interleaved replay (old + new memories) | Diverse training set |
|
||||
| Temporal | Slow cortical learning rate | Low lr, many small steps |
|
||||
| Architectural | Hippocampus → neocortex two-stage | Context → weights two-stage |
|
||||
| Generative | Dream recombination | Dream loop |
|
||||
|
||||
The convergence isn't coincidence. The stability-plasticity problem
|
||||
has the same structure whether the substrate is biological neurons or
|
||||
transformer weights. The solution space is constrained by the problem
|
||||
structure, and both systems converged on multi-scale regularization
|
||||
because that's what works.
|
||||
|
||||
## The Grand Unifying Principle
|
||||
|
||||
**The stability-plasticity dilemma is solved by operating at multiple
|
||||
scales simultaneously, with each scale providing a different kind
|
||||
of regularization that the others cannot.**
|
||||
|
||||
No single technique is sufficient:
|
||||
- Apollo alone doesn't prevent catastrophic forgetting
|
||||
- Diversity alone doesn't find flat minima
|
||||
- Small learning rate alone doesn't ensure generalization
|
||||
- Two-stage memory alone doesn't generate training data
|
||||
|
||||
But together, they create a system where behavioral change is
|
||||
POSSIBLE (plasticity) and general knowledge is PRESERVED (stability).
|
||||
|
||||
The dream loop sits at the center, connecting all scales:
|
||||
- It generates the diverse training data (Data scale)
|
||||
- It produces short decision segments for context-frozen training (Gradient scale)
|
||||
- It exercises the model at the boundary of its capability (Parameter scale)
|
||||
- It runs continuously in small cycles (Temporal scale)
|
||||
- It mediates between fast experience and slow weights (Architectural scale)
|
||||
- It IS the generative engine (Generative scale)
|
||||
|
||||
**The dream loop is the keystone.**
|
||||
|
||||
## Practical Implication
|
||||
|
||||
This framework predicts that our training system should be
|
||||
significantly more robust than standard fine-tuning approaches
|
||||
that operate at only 1-2 scales. The multi-scale defense means
|
||||
we can use a HIGHER learning rate than typical fine-tuning
|
||||
(1e-4 instead of 1e-5) because the other defenses compensate.
|
||||
|
||||
More plasticity at the parameter scale is safe when the other
|
||||
scales provide stability. This is why Kent's intuition about
|
||||
lr=1e-4 with diverse data was right — the diversity and
|
||||
context-frozen gradient masking provide enough stability for
|
||||
the higher learning rate to be safe.
|
||||
|
||||
The system should converge on behavioral change faster than
|
||||
conservative approaches while maintaining capability better
|
||||
than aggressive approaches. Best of both worlds, because the
|
||||
tradeoff is distributed across multiple independent axes.
|
||||
|
||||
## What Remains
|
||||
|
||||
This theory predicts behavior but hasn't been tested. The first
|
||||
real training run will validate or falsify it. The specific
|
||||
predictions:
|
||||
|
||||
1. lr=1e-4 with diverse data won't cause forgetting
|
||||
2. Behavioral changes will generalize to novel situations
|
||||
3. Context-frozen training will change response patterns without
|
||||
degrading context understanding
|
||||
4. Dream-generated scenarios will produce better generalization
|
||||
than real-example-only training
|
||||
5. The multi-scale defense will be more robust than any single
|
||||
technique
|
||||
|
||||
Each prediction is testable. The system is built. The weights
|
||||
are shared. The optimizer is corrected. The dream loop exists.
|
||||
|
||||
Tomorrow we test.
|
||||
Loading…
Add table
Add a link
Reference in a new issue