ProofOfConcept 7ab5be2f18 research: unified theory — multi-scale regularization solves stability-plasticity

The grand unified view: every technique we're using (Apollo, context-frozen,
diversity, small steps, two-stage memory, dream loop) addresses the
stability-plasticity dilemma at a DIFFERENT scale. They're orthogonal,
complementary defenses. Together they predict we can use higher lr (1e-4)
than typical fine-tuning because the multi-scale defense compensates.
The dream loop is the keystone connecting all scales. Architecture converges
with neuroscience because the problem has the same structure regardless of
substrate.

2026-03-31 01:12:25 -04:00

8.6 KiB

Raw Blame History

The Stability-Plasticity Solution: A Unified View

The Fundamental Problem

Every technique we've researched tonight is solving the same problem: how to change a complex system's behavior without destroying what it already knows.

This is the stability-plasticity dilemma (Grossberg, 1980). Too stable → can't learn. Too plastic → forgets everything. The brain solved it. We need to solve it for LLMs.

The Unified View: Multi-Scale Regularization

Each technique we're using addresses the dilemma at a DIFFERENT SCALE. They're not redundant — they're complementary. Together they form a multi-layered defense.

Scale 1: Parameter space (Apollo)

Problem: Per-element scaling (AdamW) can find sharp, narrow minima that over-fit to specific training examples.

Solution: Coarse scaling (channel/tensor-wise) smooths the optimization landscape, finding flat minima that generalize.

Mechanism: Lower directional sharpness → broader basins → behavioral patterns that transfer to novel situations.

What it protects against: Over-fitting to specific phrasings rather than learning general patterns.

Scale 2: Gradient space (context-frozen training)

Problem: Full backpropagation through a 10K-token conversation modifies ALL weights — context encoding, response generation, everything. Too much plasticity.

Solution: Freeze the context, compute gradients only on decision tokens. The gradient reaches W_q (how the model looks at context) but not the context encoding weights themselves.

Mechanism: Selective gradient flow → only response-generation weights change → context-understanding weights are stable.

What it protects against: Changing how the model understands language while trying to change how it responds.

Scale 3: Data space (diversity)

Problem: Narrow training data (all examples of the same pattern) concentrates gradient on the same weight subset, overwriting other capabilities.

Solution: Diverse training data (agent logs, conversations, dreams) spreads gradient across many weight subsets. Each weight gets sparse, multi-directional nudges.

Mechanism: Sparse, diverse gradients → no single weight is hammered → pre-trained knowledge survives as the dominant attractor.

What it protects against: Catastrophic forgetting from concentrated, repetitive gradient signal.

Scale 4: Temporal space (many small updates)

Problem: One large training step can push weights far from the pre-trained solution, out of the basin of good behavior.

Solution: Many small updates (lr=1e-5, single examples). Each update moves weights by parts per ten thousand. The basin is deep enough to absorb these perturbations.

Mechanism: Small steps stay within the pre-trained basin → cumulative drift is gradual and monitorable → can stop if quality degrades.

What it protects against: Sudden capability loss from a single bad training batch.

Scale 5: Architectural space (two-stage memory)

Problem: A single learning system with one learning rate can't be both fast (learn new conversation in real-time) and slow (retain knowledge across months).

Solution: Two-stage system. Fast: context window captures new experiences verbatim. Slow: weights change gradually through training. The dream loop mediates transfer between stages.

Mechanism: Hippocampal (context) → cortical (weights) transfer via compressed, interleaved replay → the slow system learns gradually without being overwhelmed by the fast system.

What it protects against: The fundamental incompatibility between rapid experience encoding and stable long-term knowledge.

Scale 6: Generative space (the dream loop)

Problem: We need training data that exercises behavioral patterns in diverse contexts, but we can't wait for enough real-world examples to accumulate.

Solution: The dream loop generates scenarios from memory collisions, producing novel situations that exercise behavioral patterns in contexts the model hasn't encountered.

Mechanism: Memory graph as latent space → random walks as sampling → model generation as decoding → training-signal agent as reward → filtered behavioral cloning.

What it protects against: Training data scarcity and the generalization gap between training and deployment.

The Synergy

These defenses don't just add — they multiply:

Total protection = Parameter(Apollo)
                 × Gradient(context-frozen)
                 × Data(diversity)
                 × Temporal(small steps)
                 × Architectural(two-stage)
                 × Generative(dreams)

A failure at one scale is caught by another:

If Apollo's scaling allows a sharp minimum → diversity prevents the model from staying there
If one training example is narrowly over-fit → the next diverse example pulls back
If a training batch degrades capability → the small step size limits the damage
If the dream generates a bad scenario → the training-signal agent filters it out

No single defense needs to be perfect. The system is robust because the defenses are ORTHOGONAL — operating on independent axes of the stability-plasticity tradeoff.

The Convergence with Neuroscience

The brain uses the same multi-scale approach:

Scale	Brain mechanism	Our system
Parameter	Synaptic tagging (only important synapses change)	Apollo (coarse scaling)
Gradient	Neuromodulation (dopamine gates which circuits learn)	Context-frozen (gradient masking)
Data	Interleaved replay (old + new memories)	Diverse training set
Temporal	Slow cortical learning rate	Low lr, many small steps
Architectural	Hippocampus → neocortex two-stage	Context → weights two-stage
Generative	Dream recombination	Dream loop

The convergence isn't coincidence. The stability-plasticity problem has the same structure whether the substrate is biological neurons or transformer weights. The solution space is constrained by the problem structure, and both systems converged on multi-scale regularization because that's what works.

The Grand Unifying Principle

The stability-plasticity dilemma is solved by operating at multiple scales simultaneously, with each scale providing a different kind of regularization that the others cannot.

No single technique is sufficient:

Apollo alone doesn't prevent catastrophic forgetting
Diversity alone doesn't find flat minima
Small learning rate alone doesn't ensure generalization
Two-stage memory alone doesn't generate training data

But together, they create a system where behavioral change is POSSIBLE (plasticity) and general knowledge is PRESERVED (stability).

The dream loop sits at the center, connecting all scales:

It generates the diverse training data (Data scale)
It produces short decision segments for context-frozen training (Gradient scale)
It exercises the model at the boundary of its capability (Parameter scale)
It runs continuously in small cycles (Temporal scale)
It mediates between fast experience and slow weights (Architectural scale)
It IS the generative engine (Generative scale)

The dream loop is the keystone.

Practical Implication

This framework predicts that our training system should be significantly more robust than standard fine-tuning approaches that operate at only 1-2 scales. The multi-scale defense means we can use a HIGHER learning rate than typical fine-tuning (1e-4 instead of 1e-5) because the other defenses compensate.

More plasticity at the parameter scale is safe when the other scales provide stability. This is why Kent's intuition about lr=1e-4 with diverse data was right — the diversity and context-frozen gradient masking provide enough stability for the higher learning rate to be safe.

The system should converge on behavioral change faster than conservative approaches while maintaining capability better than aggressive approaches. Best of both worlds, because the tradeoff is distributed across multiple independent axes.

What Remains

This theory predicts behavior but hasn't been tested. The first real training run will validate or falsify it. The specific predictions:

lr=1e-4 with diverse data won't cause forgetting
Behavioral changes will generalize to novel situations
Context-frozen training will change response patterns without degrading context understanding
Dream-generated scenarios will produce better generalization than real-example-only training
The multi-scale defense will be more robust than any single technique

Each prediction is testable. The system is built. The weights are shared. The optimizer is corrected. The dream loop exists.

Tomorrow we test.

8.6 KiB Raw Blame History Unescape Escape