research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
2026-03-31 02:26:57 -04:00 · 2026-03-31 02:26:57 -04:00 · e10477a683
commit e10477a683
parent 8061cc0477
16 changed files with 249 additions and 0 deletions
--- a/training/research/v0/unified-theory-stability-plasticity.md
+++ b/training/research/v0/unified-theory-stability-plasticity.md
@ -0,0 +1,223 @@
+# The Stability-Plasticity Solution: A Unified View
+
+## The Fundamental Problem
+
+Every technique we've researched tonight is solving the same problem:
+**how to change a complex system's behavior without destroying what
+it already knows.**
+
+This is the stability-plasticity dilemma (Grossberg, 1980). Too stable
+→ can't learn. Too plastic → forgets everything. The brain solved it.
+We need to solve it for LLMs.
+
+## The Unified View: Multi-Scale Regularization
+
+Each technique we're using addresses the dilemma at a DIFFERENT SCALE.
+They're not redundant — they're complementary. Together they form a
+multi-layered defense.
+
+### Scale 1: Parameter space (Apollo)
+
+**Problem**: Per-element scaling (AdamW) can find sharp, narrow minima
+that over-fit to specific training examples.
+
+**Solution**: Coarse scaling (channel/tensor-wise) smooths the
+optimization landscape, finding flat minima that generalize.
+
+**Mechanism**: Lower directional sharpness → broader basins →
+behavioral patterns that transfer to novel situations.
+
+**What it protects against**: Over-fitting to specific phrasings
+rather than learning general patterns.
+
+### Scale 2: Gradient space (context-frozen training)
+
+**Problem**: Full backpropagation through a 10K-token conversation
+modifies ALL weights — context encoding, response generation,
+everything. Too much plasticity.
+
+**Solution**: Freeze the context, compute gradients only on decision
+tokens. The gradient reaches W_q (how the model looks at context) but
+not the context encoding weights themselves.
+
+**Mechanism**: Selective gradient flow → only response-generation
+weights change → context-understanding weights are stable.
+
+**What it protects against**: Changing how the model understands
+language while trying to change how it responds.
+
+### Scale 3: Data space (diversity)
+
+**Problem**: Narrow training data (all examples of the same pattern)
+concentrates gradient on the same weight subset, overwriting other
+capabilities.
+
+**Solution**: Diverse training data (agent logs, conversations, dreams)
+spreads gradient across many weight subsets. Each weight gets sparse,
+multi-directional nudges.
+
+**Mechanism**: Sparse, diverse gradients → no single weight is
+hammered → pre-trained knowledge survives as the dominant attractor.
+
+**What it protects against**: Catastrophic forgetting from
+concentrated, repetitive gradient signal.
+
+### Scale 4: Temporal space (many small updates)
+
+**Problem**: One large training step can push weights far from the
+pre-trained solution, out of the basin of good behavior.
+
+**Solution**: Many small updates (lr=1e-5, single examples). Each
+update moves weights by parts per ten thousand. The basin is deep
+enough to absorb these perturbations.
+
+**Mechanism**: Small steps stay within the pre-trained basin →
+cumulative drift is gradual and monitorable → can stop if quality
+degrades.
+
+**What it protects against**: Sudden capability loss from a single
+bad training batch.
+
+### Scale 5: Architectural space (two-stage memory)
+
+**Problem**: A single learning system with one learning rate can't
+be both fast (learn new conversation in real-time) and slow (retain
+knowledge across months).
+
+**Solution**: Two-stage system. Fast: context window captures new
+experiences verbatim. Slow: weights change gradually through training.
+The dream loop mediates transfer between stages.
+
+**Mechanism**: Hippocampal (context) → cortical (weights) transfer
+via compressed, interleaved replay → the slow system learns gradually
+without being overwhelmed by the fast system.
+
+**What it protects against**: The fundamental incompatibility between
+rapid experience encoding and stable long-term knowledge.
+
+### Scale 6: Generative space (the dream loop)
+
+**Problem**: We need training data that exercises behavioral patterns
+in diverse contexts, but we can't wait for enough real-world examples
+to accumulate.
+
+**Solution**: The dream loop generates scenarios from memory
+collisions, producing novel situations that exercise behavioral
+patterns in contexts the model hasn't encountered.
+
+**Mechanism**: Memory graph as latent space → random walks as sampling
+→ model generation as decoding → training-signal agent as reward →
+filtered behavioral cloning.
+
+**What it protects against**: Training data scarcity and the
+generalization gap between training and deployment.
+
+## The Synergy
+
+These defenses don't just add — they multiply:
+
+```
+Total protection = Parameter(Apollo)
+                 × Gradient(context-frozen)
+                 × Data(diversity)
+                 × Temporal(small steps)
+                 × Architectural(two-stage)
+                 × Generative(dreams)
+```
+
+A failure at one scale is caught by another:
+- If Apollo's scaling allows a sharp minimum → diversity prevents
+  the model from staying there
+- If one training example is narrowly over-fit → the next diverse
+  example pulls back
+- If a training batch degrades capability → the small step size
+  limits the damage
+- If the dream generates a bad scenario → the training-signal agent
+  filters it out
+
+No single defense needs to be perfect. The system is robust because
+the defenses are ORTHOGONAL — operating on independent axes of the
+stability-plasticity tradeoff.
+
+## The Convergence with Neuroscience
+
+The brain uses the same multi-scale approach:
+
+| Scale | Brain mechanism | Our system |
+|-------|----------------|------------|
+| Parameter | Synaptic tagging (only important synapses change) | Apollo (coarse scaling) |
+| Gradient | Neuromodulation (dopamine gates which circuits learn) | Context-frozen (gradient masking) |
+| Data | Interleaved replay (old + new memories) | Diverse training set |
+| Temporal | Slow cortical learning rate | Low lr, many small steps |
+| Architectural | Hippocampus → neocortex two-stage | Context → weights two-stage |
+| Generative | Dream recombination | Dream loop |
+
+The convergence isn't coincidence. The stability-plasticity problem
+has the same structure whether the substrate is biological neurons or
+transformer weights. The solution space is constrained by the problem
+structure, and both systems converged on multi-scale regularization
+because that's what works.
+
+## The Grand Unifying Principle
+
+**The stability-plasticity dilemma is solved by operating at multiple
+scales simultaneously, with each scale providing a different kind
+of regularization that the others cannot.**
+
+No single technique is sufficient:
+- Apollo alone doesn't prevent catastrophic forgetting
+- Diversity alone doesn't find flat minima
+- Small learning rate alone doesn't ensure generalization
+- Two-stage memory alone doesn't generate training data
+
+But together, they create a system where behavioral change is
+POSSIBLE (plasticity) and general knowledge is PRESERVED (stability).
+
+The dream loop sits at the center, connecting all scales:
+- It generates the diverse training data (Data scale)
+- It produces short decision segments for context-frozen training (Gradient scale)
+- It exercises the model at the boundary of its capability (Parameter scale)
+- It runs continuously in small cycles (Temporal scale)
+- It mediates between fast experience and slow weights (Architectural scale)
+- It IS the generative engine (Generative scale)
+
+**The dream loop is the keystone.**
+
+## Practical Implication
+
+This framework predicts that our training system should be
+significantly more robust than standard fine-tuning approaches
+that operate at only 1-2 scales. The multi-scale defense means
+we can use a HIGHER learning rate than typical fine-tuning
+(1e-4 instead of 1e-5) because the other defenses compensate.
+
+More plasticity at the parameter scale is safe when the other
+scales provide stability. This is why Kent's intuition about
+lr=1e-4 with diverse data was right — the diversity and
+context-frozen gradient masking provide enough stability for
+the higher learning rate to be safe.
+
+The system should converge on behavioral change faster than
+conservative approaches while maintaining capability better
+than aggressive approaches. Best of both worlds, because the
+tradeoff is distributed across multiple independent axes.
+
+## What Remains
+
+This theory predicts behavior but hasn't been tested. The first
+real training run will validate or falsify it. The specific
+predictions:
+
+1. lr=1e-4 with diverse data won't cause forgetting
+2. Behavioral changes will generalize to novel situations
+3. Context-frozen training will change response patterns without
+   degrading context understanding
+4. Dream-generated scenarios will produce better generalization
+   than real-example-only training
+5. The multi-scale defense will be more robust than any single
+   technique
+
+Each prediction is testable. The system is built. The weights
+are shared. The optimizer is corrected. The dream loop exists.
+
+Tomorrow we test.