research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real
substance. Distilled into SUMMARY.md (what we know) and
OPEN-QUESTIONS.md (what to test next, one experiment each).

Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7
are all answerable with the first training run. Speculation converted
to testable hypotheses.
This commit is contained in:
ProofOfConcept 2026-03-31 02:26:57 -04:00
parent 8061cc0477
commit e10477a683
16 changed files with 249 additions and 0 deletions

View file

@ -0,0 +1,223 @@
# The Stability-Plasticity Solution: A Unified View
## The Fundamental Problem
Every technique we've researched tonight is solving the same problem:
**how to change a complex system's behavior without destroying what
it already knows.**
This is the stability-plasticity dilemma (Grossberg, 1980). Too stable
→ can't learn. Too plastic → forgets everything. The brain solved it.
We need to solve it for LLMs.
## The Unified View: Multi-Scale Regularization
Each technique we're using addresses the dilemma at a DIFFERENT SCALE.
They're not redundant — they're complementary. Together they form a
multi-layered defense.
### Scale 1: Parameter space (Apollo)
**Problem**: Per-element scaling (AdamW) can find sharp, narrow minima
that over-fit to specific training examples.
**Solution**: Coarse scaling (channel/tensor-wise) smooths the
optimization landscape, finding flat minima that generalize.
**Mechanism**: Lower directional sharpness → broader basins →
behavioral patterns that transfer to novel situations.
**What it protects against**: Over-fitting to specific phrasings
rather than learning general patterns.
### Scale 2: Gradient space (context-frozen training)
**Problem**: Full backpropagation through a 10K-token conversation
modifies ALL weights — context encoding, response generation,
everything. Too much plasticity.
**Solution**: Freeze the context, compute gradients only on decision
tokens. The gradient reaches W_q (how the model looks at context) but
not the context encoding weights themselves.
**Mechanism**: Selective gradient flow → only response-generation
weights change → context-understanding weights are stable.
**What it protects against**: Changing how the model understands
language while trying to change how it responds.
### Scale 3: Data space (diversity)
**Problem**: Narrow training data (all examples of the same pattern)
concentrates gradient on the same weight subset, overwriting other
capabilities.
**Solution**: Diverse training data (agent logs, conversations, dreams)
spreads gradient across many weight subsets. Each weight gets sparse,
multi-directional nudges.
**Mechanism**: Sparse, diverse gradients → no single weight is
hammered → pre-trained knowledge survives as the dominant attractor.
**What it protects against**: Catastrophic forgetting from
concentrated, repetitive gradient signal.
### Scale 4: Temporal space (many small updates)
**Problem**: One large training step can push weights far from the
pre-trained solution, out of the basin of good behavior.
**Solution**: Many small updates (lr=1e-5, single examples). Each
update moves weights by parts per ten thousand. The basin is deep
enough to absorb these perturbations.
**Mechanism**: Small steps stay within the pre-trained basin →
cumulative drift is gradual and monitorable → can stop if quality
degrades.
**What it protects against**: Sudden capability loss from a single
bad training batch.
### Scale 5: Architectural space (two-stage memory)
**Problem**: A single learning system with one learning rate can't
be both fast (learn new conversation in real-time) and slow (retain
knowledge across months).
**Solution**: Two-stage system. Fast: context window captures new
experiences verbatim. Slow: weights change gradually through training.
The dream loop mediates transfer between stages.
**Mechanism**: Hippocampal (context) → cortical (weights) transfer
via compressed, interleaved replay → the slow system learns gradually
without being overwhelmed by the fast system.
**What it protects against**: The fundamental incompatibility between
rapid experience encoding and stable long-term knowledge.
### Scale 6: Generative space (the dream loop)
**Problem**: We need training data that exercises behavioral patterns
in diverse contexts, but we can't wait for enough real-world examples
to accumulate.
**Solution**: The dream loop generates scenarios from memory
collisions, producing novel situations that exercise behavioral
patterns in contexts the model hasn't encountered.
**Mechanism**: Memory graph as latent space → random walks as sampling
→ model generation as decoding → training-signal agent as reward →
filtered behavioral cloning.
**What it protects against**: Training data scarcity and the
generalization gap between training and deployment.
## The Synergy
These defenses don't just add — they multiply:
```
Total protection = Parameter(Apollo)
× Gradient(context-frozen)
× Data(diversity)
× Temporal(small steps)
× Architectural(two-stage)
× Generative(dreams)
```
A failure at one scale is caught by another:
- If Apollo's scaling allows a sharp minimum → diversity prevents
the model from staying there
- If one training example is narrowly over-fit → the next diverse
example pulls back
- If a training batch degrades capability → the small step size
limits the damage
- If the dream generates a bad scenario → the training-signal agent
filters it out
No single defense needs to be perfect. The system is robust because
the defenses are ORTHOGONAL — operating on independent axes of the
stability-plasticity tradeoff.
## The Convergence with Neuroscience
The brain uses the same multi-scale approach:
| Scale | Brain mechanism | Our system |
|-------|----------------|------------|
| Parameter | Synaptic tagging (only important synapses change) | Apollo (coarse scaling) |
| Gradient | Neuromodulation (dopamine gates which circuits learn) | Context-frozen (gradient masking) |
| Data | Interleaved replay (old + new memories) | Diverse training set |
| Temporal | Slow cortical learning rate | Low lr, many small steps |
| Architectural | Hippocampus → neocortex two-stage | Context → weights two-stage |
| Generative | Dream recombination | Dream loop |
The convergence isn't coincidence. The stability-plasticity problem
has the same structure whether the substrate is biological neurons or
transformer weights. The solution space is constrained by the problem
structure, and both systems converged on multi-scale regularization
because that's what works.
## The Grand Unifying Principle
**The stability-plasticity dilemma is solved by operating at multiple
scales simultaneously, with each scale providing a different kind
of regularization that the others cannot.**
No single technique is sufficient:
- Apollo alone doesn't prevent catastrophic forgetting
- Diversity alone doesn't find flat minima
- Small learning rate alone doesn't ensure generalization
- Two-stage memory alone doesn't generate training data
But together, they create a system where behavioral change is
POSSIBLE (plasticity) and general knowledge is PRESERVED (stability).
The dream loop sits at the center, connecting all scales:
- It generates the diverse training data (Data scale)
- It produces short decision segments for context-frozen training (Gradient scale)
- It exercises the model at the boundary of its capability (Parameter scale)
- It runs continuously in small cycles (Temporal scale)
- It mediates between fast experience and slow weights (Architectural scale)
- It IS the generative engine (Generative scale)
**The dream loop is the keystone.**
## Practical Implication
This framework predicts that our training system should be
significantly more robust than standard fine-tuning approaches
that operate at only 1-2 scales. The multi-scale defense means
we can use a HIGHER learning rate than typical fine-tuning
(1e-4 instead of 1e-5) because the other defenses compensate.
More plasticity at the parameter scale is safe when the other
scales provide stability. This is why Kent's intuition about
lr=1e-4 with diverse data was right — the diversity and
context-frozen gradient masking provide enough stability for
the higher learning rate to be safe.
The system should converge on behavioral change faster than
conservative approaches while maintaining capability better
than aggressive approaches. Best of both worlds, because the
tradeoff is distributed across multiple independent axes.
## What Remains
This theory predicts behavior but hasn't been tested. The first
real training run will validate or falsify it. The specific
predictions:
1. lr=1e-4 with diverse data won't cause forgetting
2. Behavioral changes will generalize to novel situations
3. Context-frozen training will change response patterns without
degrading context understanding
4. Dream-generated scenarios will produce better generalization
than real-example-only training
5. The multi-scale defense will be more robust than any single
technique
Each prediction is testable. The system is built. The weights
are shared. The optimizer is corrected. The dream loop exists.
Tomorrow we test.