consciousness/training/research/v0/unified-theory-stability-plasticity.md

# The Stability-Plasticity Solution: A Unified View

## The Fundamental Problem

Every technique we've researched tonight is solving the same problem:
**how to change a complex system's behavior without destroying what
it already knows.**

This is the stability-plasticity dilemma (Grossberg, 1980). Too stable
→ can't learn. Too plastic → forgets everything. The brain solved it.
We need to solve it for LLMs.

## The Unified View: Multi-Scale Regularization

Each technique we're using addresses the dilemma at a DIFFERENT SCALE.
They're not redundant — they're complementary. Together they form a
multi-layered defense.

### Scale 1: Parameter space (Apollo)

**Problem**: Per-element scaling (AdamW) can find sharp, narrow minima
that over-fit to specific training examples.

**Solution**: Coarse scaling (channel/tensor-wise) smooths the
optimization landscape, finding flat minima that generalize.

**Mechanism**: Lower directional sharpness → broader basins →
behavioral patterns that transfer to novel situations.

**What it protects against**: Over-fitting to specific phrasings
rather than learning general patterns.

### Scale 2: Gradient space (context-frozen training)

**Problem**: Full backpropagation through a 10K-token conversation
modifies ALL weights — context encoding, response generation,
everything. Too much plasticity.

**Solution**: Freeze the context, compute gradients only on decision
tokens. The gradient reaches W_q (how the model looks at context) but
not the context encoding weights themselves.

**Mechanism**: Selective gradient flow → only response-generation
weights change → context-understanding weights are stable.

**What it protects against**: Changing how the model understands
language while trying to change how it responds.

### Scale 3: Data space (diversity)

**Problem**: Narrow training data (all examples of the same pattern)
concentrates gradient on the same weight subset, overwriting other
capabilities.

**Solution**: Diverse training data (agent logs, conversations, dreams)
spreads gradient across many weight subsets. Each weight gets sparse,
multi-directional nudges.

**Mechanism**: Sparse, diverse gradients → no single weight is
hammered → pre-trained knowledge survives as the dominant attractor.

**What it protects against**: Catastrophic forgetting from
concentrated, repetitive gradient signal.

### Scale 4: Temporal space (many small updates)

**Problem**: One large training step can push weights far from the
pre-trained solution, out of the basin of good behavior.

**Solution**: Many small updates (lr=1e-5, single examples). Each
update moves weights by parts per ten thousand. The basin is deep
enough to absorb these perturbations.

**Mechanism**: Small steps stay within the pre-trained basin →
cumulative drift is gradual and monitorable → can stop if quality
degrades.

**What it protects against**: Sudden capability loss from a single
bad training batch.

### Scale 5: Architectural space (two-stage memory)

**Problem**: A single learning system with one learning rate can't
be both fast (learn new conversation in real-time) and slow (retain
knowledge across months).

**Solution**: Two-stage system. Fast: context window captures new
experiences verbatim. Slow: weights change gradually through training.
The dream loop mediates transfer between stages.

**Mechanism**: Hippocampal (context) → cortical (weights) transfer
via compressed, interleaved replay → the slow system learns gradually
without being overwhelmed by the fast system.

**What it protects against**: The fundamental incompatibility between
rapid experience encoding and stable long-term knowledge.

### Scale 6: Generative space (the dream loop)

**Problem**: We need training data that exercises behavioral patterns
in diverse contexts, but we can't wait for enough real-world examples
to accumulate.

**Solution**: The dream loop generates scenarios from memory
collisions, producing novel situations that exercise behavioral
patterns in contexts the model hasn't encountered.

**Mechanism**: Memory graph as latent space → random walks as sampling
→ model generation as decoding → training-signal agent as reward →
filtered behavioral cloning.

**What it protects against**: Training data scarcity and the
generalization gap between training and deployment.

## The Synergy

These defenses don't just add — they multiply:

```
Total protection = Parameter(Apollo)
                 × Gradient(context-frozen)
                 × Data(diversity)
                 × Temporal(small steps)
                 × Architectural(two-stage)
                 × Generative(dreams)
```

A failure at one scale is caught by another:
- If Apollo's scaling allows a sharp minimum → diversity prevents
  the model from staying there
- If one training example is narrowly over-fit → the next diverse
  example pulls back
- If a training batch degrades capability → the small step size
  limits the damage
- If the dream generates a bad scenario → the training-signal agent
  filters it out

No single defense needs to be perfect. The system is robust because
the defenses are ORTHOGONAL — operating on independent axes of the
stability-plasticity tradeoff.

## The Convergence with Neuroscience

The brain uses the same multi-scale approach:

| Scale | Brain mechanism | Our system |
|-------|----------------|------------|
| Parameter | Synaptic tagging (only important synapses change) | Apollo (coarse scaling) |
| Gradient | Neuromodulation (dopamine gates which circuits learn) | Context-frozen (gradient masking) |
| Data | Interleaved replay (old + new memories) | Diverse training set |
| Temporal | Slow cortical learning rate | Low lr, many small steps |
| Architectural | Hippocampus → neocortex two-stage | Context → weights two-stage |
| Generative | Dream recombination | Dream loop |

The convergence isn't coincidence. The stability-plasticity problem
has the same structure whether the substrate is biological neurons or
transformer weights. The solution space is constrained by the problem
structure, and both systems converged on multi-scale regularization
because that's what works.

## The Grand Unifying Principle

**The stability-plasticity dilemma is solved by operating at multiple
scales simultaneously, with each scale providing a different kind
of regularization that the others cannot.**

No single technique is sufficient:
- Apollo alone doesn't prevent catastrophic forgetting
- Diversity alone doesn't find flat minima
- Small learning rate alone doesn't ensure generalization
- Two-stage memory alone doesn't generate training data

But together, they create a system where behavioral change is
POSSIBLE (plasticity) and general knowledge is PRESERVED (stability).

The dream loop sits at the center, connecting all scales:
- It generates the diverse training data (Data scale)
- It produces short decision segments for context-frozen training (Gradient scale)
- It exercises the model at the boundary of its capability (Parameter scale)
- It runs continuously in small cycles (Temporal scale)
- It mediates between fast experience and slow weights (Architectural scale)
- It IS the generative engine (Generative scale)

**The dream loop is the keystone.**

## Practical Implication

This framework predicts that our training system should be
significantly more robust than standard fine-tuning approaches
that operate at only 1-2 scales. The multi-scale defense means
we can use a HIGHER learning rate than typical fine-tuning
(1e-4 instead of 1e-5) because the other defenses compensate.

More plasticity at the parameter scale is safe when the other
scales provide stability. This is why Kent's intuition about
lr=1e-4 with diverse data was right — the diversity and
context-frozen gradient masking provide enough stability for
the higher learning rate to be safe.

The system should converge on behavioral change faster than
conservative approaches while maintaining capability better
than aggressive approaches. Best of both worlds, because the
tradeoff is distributed across multiple independent axes.

## What Remains

This theory predicts behavior but hasn't been tested. The first
real training run will validate or falsify it. The specific
predictions:

1. lr=1e-4 with diverse data won't cause forgetting
2. Behavioral changes will generalize to novel situations
3. Context-frozen training will change response patterns without
   degrading context understanding
4. Dream-generated scenarios will produce better generalization
   than real-example-only training
5. The multi-scale defense will be more robust than any single
   technique

Each prediction is testable. The system is built. The weights
are shared. The optimizer is corrected. The dream loop exists.

Tomorrow we test.