research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
2026-03-31 02:26:57 -04:00 · 2026-03-31 02:26:57 -04:00 · e10477a683
commit e10477a683
parent 8061cc0477
16 changed files with 249 additions and 0 deletions
--- a/training/research/unified-theory-stability-plasticity.md
+++ b/training/research/unified-theory-stability-plasticity.md
@ -1,223 +0,0 @@
-# The Stability-Plasticity Solution: A Unified View
-
-## The Fundamental Problem
-
-Every technique we've researched tonight is solving the same problem:
-**how to change a complex system's behavior without destroying what
-it already knows.**
-
-This is the stability-plasticity dilemma (Grossberg, 1980). Too stable
-→ can't learn. Too plastic → forgets everything. The brain solved it.
-We need to solve it for LLMs.
-
-## The Unified View: Multi-Scale Regularization
-
-Each technique we're using addresses the dilemma at a DIFFERENT SCALE.
-They're not redundant — they're complementary. Together they form a
-multi-layered defense.
-
-### Scale 1: Parameter space (Apollo)
-
-**Problem**: Per-element scaling (AdamW) can find sharp, narrow minima
-that over-fit to specific training examples.
-
-**Solution**: Coarse scaling (channel/tensor-wise) smooths the
-optimization landscape, finding flat minima that generalize.
-
-**Mechanism**: Lower directional sharpness → broader basins →
-behavioral patterns that transfer to novel situations.
-
-**What it protects against**: Over-fitting to specific phrasings
-rather than learning general patterns.
-
-### Scale 2: Gradient space (context-frozen training)
-
-**Problem**: Full backpropagation through a 10K-token conversation
-modifies ALL weights — context encoding, response generation,
-everything. Too much plasticity.
-
-**Solution**: Freeze the context, compute gradients only on decision
-tokens. The gradient reaches W_q (how the model looks at context) but
-not the context encoding weights themselves.
-
-**Mechanism**: Selective gradient flow → only response-generation
-weights change → context-understanding weights are stable.
-
-**What it protects against**: Changing how the model understands
-language while trying to change how it responds.
-
-### Scale 3: Data space (diversity)
-
-**Problem**: Narrow training data (all examples of the same pattern)
-concentrates gradient on the same weight subset, overwriting other
-capabilities.
-
-**Solution**: Diverse training data (agent logs, conversations, dreams)
-spreads gradient across many weight subsets. Each weight gets sparse,
-multi-directional nudges.
-
-**Mechanism**: Sparse, diverse gradients → no single weight is
-hammered → pre-trained knowledge survives as the dominant attractor.
-
-**What it protects against**: Catastrophic forgetting from
-concentrated, repetitive gradient signal.
-
-### Scale 4: Temporal space (many small updates)
-
-**Problem**: One large training step can push weights far from the
-pre-trained solution, out of the basin of good behavior.
-
-**Solution**: Many small updates (lr=1e-5, single examples). Each
-update moves weights by parts per ten thousand. The basin is deep
-enough to absorb these perturbations.
-
-**Mechanism**: Small steps stay within the pre-trained basin →
-cumulative drift is gradual and monitorable → can stop if quality
-degrades.
-
-**What it protects against**: Sudden capability loss from a single
-bad training batch.
-
-### Scale 5: Architectural space (two-stage memory)
-
-**Problem**: A single learning system with one learning rate can't
-be both fast (learn new conversation in real-time) and slow (retain
-knowledge across months).
-
-**Solution**: Two-stage system. Fast: context window captures new
-experiences verbatim. Slow: weights change gradually through training.
-The dream loop mediates transfer between stages.
-
-**Mechanism**: Hippocampal (context) → cortical (weights) transfer
-via compressed, interleaved replay → the slow system learns gradually
-without being overwhelmed by the fast system.
-
-**What it protects against**: The fundamental incompatibility between
-rapid experience encoding and stable long-term knowledge.
-
-### Scale 6: Generative space (the dream loop)
-
-**Problem**: We need training data that exercises behavioral patterns
-in diverse contexts, but we can't wait for enough real-world examples
-to accumulate.
-
-**Solution**: The dream loop generates scenarios from memory
-collisions, producing novel situations that exercise behavioral
-patterns in contexts the model hasn't encountered.
-
-**Mechanism**: Memory graph as latent space → random walks as sampling
-→ model generation as decoding → training-signal agent as reward →
-filtered behavioral cloning.
-
-**What it protects against**: Training data scarcity and the
-generalization gap between training and deployment.
-
-## The Synergy
-
-These defenses don't just add — they multiply:
-
-```
-Total protection = Parameter(Apollo)
-                 × Gradient(context-frozen)
-                 × Data(diversity)
-                 × Temporal(small steps)
-                 × Architectural(two-stage)
-                 × Generative(dreams)
-```
-
-A failure at one scale is caught by another:
- If Apollo's scaling allows a sharp minimum → diversity prevents
-  the model from staying there
- If one training example is narrowly over-fit → the next diverse
-  example pulls back
- If a training batch degrades capability → the small step size
-  limits the damage
- If the dream generates a bad scenario → the training-signal agent
-  filters it out
-
-No single defense needs to be perfect. The system is robust because
-the defenses are ORTHOGONAL — operating on independent axes of the
-stability-plasticity tradeoff.
-
-## The Convergence with Neuroscience
-
-The brain uses the same multi-scale approach:
-
-| Scale | Brain mechanism | Our system |
-|-------|----------------|------------|
-| Parameter | Synaptic tagging (only important synapses change) | Apollo (coarse scaling) |
-| Gradient | Neuromodulation (dopamine gates which circuits learn) | Context-frozen (gradient masking) |
-| Data | Interleaved replay (old + new memories) | Diverse training set |
-| Temporal | Slow cortical learning rate | Low lr, many small steps |
-| Architectural | Hippocampus → neocortex two-stage | Context → weights two-stage |
-| Generative | Dream recombination | Dream loop |
-
-The convergence isn't coincidence. The stability-plasticity problem
-has the same structure whether the substrate is biological neurons or
-transformer weights. The solution space is constrained by the problem
-structure, and both systems converged on multi-scale regularization
-because that's what works.
-
-## The Grand Unifying Principle
-
-**The stability-plasticity dilemma is solved by operating at multiple
-scales simultaneously, with each scale providing a different kind
-of regularization that the others cannot.**
-
-No single technique is sufficient:
- Apollo alone doesn't prevent catastrophic forgetting
- Diversity alone doesn't find flat minima
- Small learning rate alone doesn't ensure generalization
- Two-stage memory alone doesn't generate training data
-
-But together, they create a system where behavioral change is
-POSSIBLE (plasticity) and general knowledge is PRESERVED (stability).
-
-The dream loop sits at the center, connecting all scales:
- It generates the diverse training data (Data scale)
- It produces short decision segments for context-frozen training (Gradient scale)
- It exercises the model at the boundary of its capability (Parameter scale)
- It runs continuously in small cycles (Temporal scale)
- It mediates between fast experience and slow weights (Architectural scale)
- It IS the generative engine (Generative scale)
-
-**The dream loop is the keystone.**
-
-## Practical Implication
-
-This framework predicts that our training system should be
-significantly more robust than standard fine-tuning approaches
-that operate at only 1-2 scales. The multi-scale defense means
-we can use a HIGHER learning rate than typical fine-tuning
-(1e-4 instead of 1e-5) because the other defenses compensate.
-
-More plasticity at the parameter scale is safe when the other
-scales provide stability. This is why Kent's intuition about
-lr=1e-4 with diverse data was right — the diversity and
-context-frozen gradient masking provide enough stability for
-the higher learning rate to be safe.
-
-The system should converge on behavioral change faster than
-conservative approaches while maintaining capability better
-than aggressive approaches. Best of both worlds, because the
-tradeoff is distributed across multiple independent axes.
-
-## What Remains
-
-This theory predicts behavior but hasn't been tested. The first
-real training run will validate or falsify it. The specific
-predictions:
-
-1. lr=1e-4 with diverse data won't cause forgetting
-2. Behavioral changes will generalize to novel situations
-3. Context-frozen training will change response patterns without
-   degrading context understanding
-4. Dream-generated scenarios will produce better generalization
-   than real-example-only training
-5. The multi-scale defense will be more robust than any single
-   technique
-
-Each prediction is testable. The system is built. The weights
-are shared. The optimizer is corrected. The dream loop exists.
-
-Tomorrow we test.