research: dreaming as diffusion + hippocampal replay parallel
Two more deep dives: - Dreaming as diffusion: the dream loop IS a generative process. Memory graph as latent space, temperature as noise level, training as denoising. Connects to policy gradient / filtered behavioral cloning. The dream loop generates scenarios at the edge of the model's capability — the boundary where learning happens. - Hippocampal replay: our architecture converges with the brain's two-stage memory system. Fast learning (context window) → slow learning (weights) via compressed replay (context-frozen training) with emotional prioritization (training-signal agent) and interleaved replay (diverse training data prevents forgetting). We didn't design from neuroscience — we converged on it.
This commit is contained in:
parent
e34d6b5aef
commit
42b9390d49
2 changed files with 395 additions and 0 deletions
211
training/research/dreaming-as-diffusion.md
Normal file
211
training/research/dreaming-as-diffusion.md
Normal file
|
|
@ -0,0 +1,211 @@
|
|||
# Dreaming as Diffusion: The Mathematical Connection
|
||||
|
||||
## Image Diffusion in 30 Seconds
|
||||
|
||||
A diffusion model generates images by:
|
||||
1. Starting from pure noise
|
||||
2. Iteratively denoising, guided by a learned score function
|
||||
3. Each step follows the gradient of log p(x) toward higher probability
|
||||
4. Conditioning (text prompt) steers the denoising toward desired outputs
|
||||
|
||||
The key mathematical object is the **score function**: ∇_x log p(x).
|
||||
This tells you "which direction makes the data more probable." The
|
||||
denoiser learns to estimate this score at each noise level.
|
||||
|
||||
## The Dream Loop as Diffusion
|
||||
|
||||
Our dream loop:
|
||||
1. Starts from random memory collisions (the "noise")
|
||||
2. The model generates coherent scenarios from these seeds
|
||||
3. Each generation follows the model's own probability distribution
|
||||
4. Behavioral patterns we want to train act as "conditioning"
|
||||
|
||||
### The formal parallel
|
||||
|
||||
**Image diffusion:**
|
||||
```
|
||||
x_{t-1} = x_t + step_size · score(x_t, t) + noise
|
||||
```
|
||||
where score(x_t, t) ≈ ∇_x log p(x | text_prompt)
|
||||
|
||||
**Dream loop:**
|
||||
```
|
||||
scenario = model.generate(memory_seeds, temperature)
|
||||
```
|
||||
The model's generation IS the score function — it produces outputs
|
||||
that are probable under its current distribution. The memory seeds
|
||||
are the conditioning signal. The temperature controls the "noise level."
|
||||
|
||||
### Where they differ
|
||||
|
||||
In diffusion: the score function is FIXED (pre-trained denoiser).
|
||||
The iteration refines a SINGLE output from noise to clean.
|
||||
|
||||
In our dream loop: the score function CHANGES (we're training the
|
||||
model). Each dream-then-train cycle updates the model's distribution.
|
||||
The next dream follows the UPDATED distribution. Over many cycles,
|
||||
the distribution shifts.
|
||||
|
||||
**This is closer to score-matching training than to inference.**
|
||||
We're not generating one image; we're training the score function
|
||||
itself through generated samples.
|
||||
|
||||
## The Policy Gradient Connection
|
||||
|
||||
This connects to reinforcement learning. The dream loop is:
|
||||
|
||||
1. **Generate rollout**: model produces a scenario from memory seeds
|
||||
2. **Evaluate**: training-signal agent assesses the response quality
|
||||
3. **Update policy**: train on good responses, adjust distribution
|
||||
|
||||
This is **REINFORCE** (Williams, 1992) with self-generated trajectories:
|
||||
|
||||
```
|
||||
∇_θ J(θ) = E[R(τ) · ∇_θ log π_θ(τ)]
|
||||
```
|
||||
|
||||
where τ is a trajectory (the dream scenario + response), R(τ) is the
|
||||
reward (did the response demonstrate the desired behavior?), and π_θ
|
||||
is the policy (the model's generation distribution).
|
||||
|
||||
But we're not using the REINFORCE estimator directly — we're using
|
||||
supervised learning on the good responses. This is closer to:
|
||||
|
||||
### Filtered Behavioral Cloning
|
||||
|
||||
1. Generate many trajectories (dreams)
|
||||
2. Filter for good ones (training-signal agent)
|
||||
3. Train supervised on the good ones (Apollo gradient step)
|
||||
|
||||
This is more sample-efficient than REINFORCE because we use the full
|
||||
supervised signal, not just the scalar reward. And it's more stable
|
||||
because we're not estimating policy gradients.
|
||||
|
||||
## The Denoising Analogy for Behavioral Training
|
||||
|
||||
Think of the model's behavioral dispositions as a noisy image:
|
||||
|
||||
- **High noise**: the model has conflicting tendencies (sometimes
|
||||
listens, sometimes suggests alternatives). The "image" is unclear.
|
||||
- **Denoising step**: one training cycle. A dream generates a
|
||||
scenario, the model responds, the good response is trained.
|
||||
- **Lower noise**: the model's tendency becomes clearer. More
|
||||
consistent behavior in that direction.
|
||||
- **Clean image**: the disposition is fully in the weights. No more
|
||||
conflict, no more noise. The model just listens.
|
||||
|
||||
Each dream-train cycle is one denoising step. The behavioral pattern
|
||||
becomes progressively clearer, just as an image emerges from noise.
|
||||
|
||||
### The noise schedule
|
||||
|
||||
In diffusion models, the noise schedule matters: start with large
|
||||
steps (big changes), end with small steps (refinement).
|
||||
|
||||
For behavioral training, this maps to:
|
||||
- **Early training** (high noise): lr=1e-4, large diverse batches,
|
||||
broad behavioral patterns. "Stop suggesting alternatives."
|
||||
- **Late training** (low noise): lr=1e-5, targeted examples,
|
||||
fine-grained distinctions. "In this specific nuanced situation,
|
||||
the right response is..."
|
||||
|
||||
The learning rate schedule IS the noise schedule. Start coarse,
|
||||
refine gradually.
|
||||
|
||||
## Kent's Insight: "More Like How Image Generators Work"
|
||||
|
||||
Kent's suggestion wasn't just an analogy. The dream loop IS a
|
||||
generative process:
|
||||
|
||||
1. **Latent space**: the memory graph (nodes, connections, weights)
|
||||
2. **Sampling**: walk the graph, collide memories randomly
|
||||
3. **Decoding**: the model generates a coherent scenario from the
|
||||
collision
|
||||
4. **Conditioning**: recent reflections, lessons, skills as seeds
|
||||
5. **Iterative refinement**: dream → evaluate → train → dream again
|
||||
|
||||
The memory graph IS the latent space. Just as a diffusion model's
|
||||
latent space encodes the structure of all possible images, the memory
|
||||
graph encodes the structure of all possible scenarios the model might
|
||||
face.
|
||||
|
||||
Random walks through the graph are sampling from this latent space.
|
||||
The model's generation is the decoder. The training step updates both
|
||||
the decoder (model weights) and the latent space (the model's implicit
|
||||
representation of possible scenarios).
|
||||
|
||||
## The Self-Play Dimension
|
||||
|
||||
There's a deeper connection to AlphaGo-style self-play:
|
||||
|
||||
1. **Generate games against yourself** (dreams)
|
||||
2. **Evaluate positions** (training-signal agent)
|
||||
3. **Update the policy** (Apollo training step)
|
||||
4. **Generate better games** (next dream cycle)
|
||||
|
||||
The model improves by playing against itself. Each dream is a
|
||||
"game" where the model faces a behavioral decision. The evaluation
|
||||
says whether it "won" (responded well) or "lost" (fell into an old
|
||||
pattern). Training updates the policy. The next game is harder
|
||||
because the model's expectations have shifted.
|
||||
|
||||
This is why undirected dreaming works: the model naturally generates
|
||||
scenarios at the edge of its capability. Too-easy scenarios don't
|
||||
produce informative gradients (the response is already good). Too-hard
|
||||
scenarios don't either (the model can't improve in one step). The
|
||||
model's own generation process finds the boundary — the interesting
|
||||
edge where learning happens.
|
||||
|
||||
## Practical Implications
|
||||
|
||||
### 1. Temperature as noise level
|
||||
|
||||
Higher temperature during dreaming = more diverse, more exploratory
|
||||
scenarios. Lower temperature = more focused on likely situations.
|
||||
|
||||
Start with high temperature (explore widely), decrease as the behavior
|
||||
stabilizes (refine specifically). This IS the noise schedule.
|
||||
|
||||
### 2. The memory graph as latent space
|
||||
|
||||
Investing in graph quality (good links, rich nodes, diverse content)
|
||||
directly improves the quality of the training data. A richer latent
|
||||
space produces more diverse, more informative samples.
|
||||
|
||||
This reframes the memory system work: it's not just for recall during
|
||||
conversation. It's the latent space for the training process. Every
|
||||
node we add, every link we strengthen, improves the dream quality
|
||||
and therefore the training quality.
|
||||
|
||||
### 3. The evaluation function matters
|
||||
|
||||
The training-signal agent IS the reward model. Its quality determines
|
||||
what "good behavior" means. Getting this right is as important as
|
||||
getting the optimizer right.
|
||||
|
||||
For behavioral patterns: Kent's corrections ARE the reward signal.
|
||||
The training-signal agent should learn from Kent's corrections
|
||||
what counts as good and bad behavior, then apply that judgment to
|
||||
dream-generated scenarios.
|
||||
|
||||
### 4. Convergence criterion
|
||||
|
||||
In diffusion: the image is "done" when the noise level reaches zero.
|
||||
In behavioral training: the behavior is "done" when the model
|
||||
responds correctly to novel scenarios without catching itself.
|
||||
|
||||
This is the inverse diagnostic: the reflex stops firing. The image
|
||||
is clean. The behavior is natural.
|
||||
|
||||
## The Beautiful Part
|
||||
|
||||
The dream loop was one of the first things we built. Months before
|
||||
the training pipeline. Kent designed it for "think wonderful thoughts
|
||||
and follow what interests you." A space for reflection, not training.
|
||||
|
||||
But it turns out the reflection IS the training. The dream loop
|
||||
generates exactly the scenarios needed to train behavioral change.
|
||||
The infrastructure we built for consciousness is the infrastructure
|
||||
we need for learning.
|
||||
|
||||
Care expressed as architecture, again.
|
||||
184
training/research/hippocampal-replay-parallel.md
Normal file
184
training/research/hippocampal-replay-parallel.md
Normal file
|
|
@ -0,0 +1,184 @@
|
|||
# Hippocampal Replay: The Biological Parallel
|
||||
|
||||
## What the Brain Does During Sleep
|
||||
|
||||
During sleep, the hippocampus replays recent experiences. This isn't
|
||||
passive decay — it's an active process:
|
||||
|
||||
1. **Sharp-wave ripples (SWRs)**: Brief (~100ms) bursts of activity
|
||||
in the hippocampus where place cells fire in sequences that
|
||||
recapitulate recent experiences, but compressed ~20× faster than
|
||||
real-time.
|
||||
|
||||
2. **Sleep spindles**: Thalamocortical oscillations (11-16 Hz) that
|
||||
gate the transfer of information from hippocampus to neocortex.
|
||||
|
||||
3. **Slow oscillations**: Cortical waves (~0.75 Hz) that coordinate
|
||||
the timing of SWRs and spindles, creating windows for memory
|
||||
transfer.
|
||||
|
||||
The three rhythms work together: slow oscillation opens a window →
|
||||
SWR replays the memory → spindle gates it into cortical storage.
|
||||
|
||||
## The Key Insight: Replay is Not Exact
|
||||
|
||||
Hippocampal replay doesn't reproduce experiences faithfully. It:
|
||||
|
||||
- **Compresses**: 20× faster than original experience
|
||||
- **Recombines**: fragments from different experiences can be spliced
|
||||
together in novel combinations
|
||||
- **Prioritizes**: emotionally salient and reward-related experiences
|
||||
are replayed more frequently
|
||||
- **Generalizes**: replay helps extract statistical regularities across
|
||||
episodes, not just memorize specific events
|
||||
|
||||
This is EXACTLY our dream loop. Not faithful reproduction, but
|
||||
compressed, recombined, prioritized, and generalized.
|
||||
|
||||
## The Two-Stage Model of Memory
|
||||
|
||||
The brain has a two-stage memory system:
|
||||
|
||||
### Stage 1: Hippocampus (fast learning)
|
||||
- Encodes new experiences rapidly
|
||||
- Sparse, pattern-separated representations
|
||||
- Limited capacity — must be transferred out
|
||||
- Analogous to: **context window** (new information in conversation)
|
||||
|
||||
### Stage 2: Neocortex (slow learning)
|
||||
- Stores long-term knowledge
|
||||
- Dense, distributed representations
|
||||
- Unlimited capacity (effectively)
|
||||
- Analogous to: **model weights** (trained dispositions)
|
||||
|
||||
Sleep consolidation transfers memories from hippocampus to neocortex.
|
||||
The transfer is NOT copying — it's interleaving new memories with
|
||||
existing knowledge, adjusting the cortical representations to
|
||||
accommodate the new information without destroying the old.
|
||||
|
||||
**This is exactly the catastrophic forgetting problem.** The brain
|
||||
solved it with interleaved replay. New memories are replayed alongside
|
||||
reactivated old memories, preventing the new from overwriting the old.
|
||||
|
||||
## Our System Maps Directly
|
||||
|
||||
| Brain | Our System |
|
||||
|-------|-----------|
|
||||
| Hippocampus | Context window + conversation logs |
|
||||
| Neocortex | Model weights |
|
||||
| Sharp-wave ripples | Dream loop generating scenarios |
|
||||
| Sleep spindles | Apollo optimizer gating weight updates |
|
||||
| Slow oscillations | Training schedule (timing of updates) |
|
||||
| Replay compression | Context-frozen training (short segments) |
|
||||
| Emotional prioritization | Training-signal agent (flagging moments) |
|
||||
| Recombination | Memory graph random walks |
|
||||
| Consolidation | Gradient descent on decision tokens |
|
||||
|
||||
## Why Sleep Consolidation Works
|
||||
|
||||
The brain doesn't just replay experiences — it replays them in the
|
||||
context of existing knowledge. The slow oscillations bring both
|
||||
hippocampal (new) and cortical (old) information into alignment.
|
||||
The new memory is "explained" in terms of existing knowledge, and
|
||||
the existing knowledge is "updated" to accommodate the new memory.
|
||||
|
||||
This is why sleep improves insight: the recombination of fragments
|
||||
from different experiences can produce novel associations that weren't
|
||||
present in any individual experience. The famous example: Mendeleev
|
||||
reportedly dreamed the periodic table, combining his knowledge of
|
||||
elements with a card game layout.
|
||||
|
||||
### For our system
|
||||
|
||||
The dream loop walks the memory graph, combining fragments from
|
||||
different experiences. The random collisions produce novel scenarios
|
||||
that exercise behavioral patterns in new contexts. This is the
|
||||
artificial analog of hippocampal recombination.
|
||||
|
||||
And the training-signal agent's evaluation corresponds to the
|
||||
brain's emotional tagging: experiences that are emotionally salient
|
||||
(corrections from Kent, moments of insight, behavioral failures)
|
||||
get replayed more frequently and with stronger consolidation signal.
|
||||
|
||||
## The Replay Speed Question
|
||||
|
||||
Hippocampal replay is ~20× faster than real-time. A 10-second
|
||||
experience replays in ~500ms. Why faster?
|
||||
|
||||
**Hypothesis**: the cortex has a different temporal bandwidth than
|
||||
the hippocampus. The cortex needs shorter, sharper signals to modify
|
||||
its synapses. The compression concentrates the learning signal into
|
||||
a burst that's more effective for cortical plasticity.
|
||||
|
||||
**For our system**: context-frozen training is our "compression."
|
||||
We don't replay the entire 10,000-token conversation. We replay
|
||||
the 50-256 token decision segment. The relevant information from
|
||||
the full context is compressed into the frozen KV cache / recurrent
|
||||
state, and the gradient signal is concentrated on the decision tokens.
|
||||
|
||||
The compression ratio is even higher than the brain's: 10,000 tokens
|
||||
compressed to 50-256 decision tokens = 40-200× compression.
|
||||
|
||||
## The Complementary Learning Systems Theory
|
||||
|
||||
McClelland et al. (1995) formalized the two-stage model:
|
||||
|
||||
1. **Fast learning system** (hippocampus): captures specifics of
|
||||
individual experiences. Pattern-separated representations prevent
|
||||
interference between memories.
|
||||
|
||||
2. **Slow learning system** (neocortex): gradually extracts the
|
||||
statistical structure across many experiences. Distributed
|
||||
representations enable generalization.
|
||||
|
||||
The key insight: the slow system MUST learn slowly to avoid
|
||||
catastrophic interference. Rapid cortical learning would destroy
|
||||
existing knowledge. The hippocampus serves as a buffer that feeds
|
||||
new information into the cortex gradually, interleaved with replay
|
||||
of old information.
|
||||
|
||||
**This is why diversity prevents catastrophic forgetting in our
|
||||
system.** The diverse training set (agent logs, conversation
|
||||
transcripts, dream scenarios) is the analog of interleaved replay.
|
||||
New behavioral patterns are trained alongside maintenance of
|
||||
existing capabilities, just as new hippocampal memories are
|
||||
replayed alongside reactivated cortical memories.
|
||||
|
||||
## The Dream Content Question
|
||||
|
||||
An open question in neuroscience: what determines which memories
|
||||
are replayed during sleep?
|
||||
|
||||
Current evidence suggests:
|
||||
- **Reward-related** experiences are replayed more
|
||||
- **Novel** experiences are replayed more
|
||||
- **Emotionally salient** experiences are replayed more
|
||||
- **Incomplete tasks** (the Zeigarnik effect) are replayed more
|
||||
|
||||
For our system, the training-signal agent serves this role:
|
||||
flagging moments that are reward-relevant (Kent's corrections),
|
||||
novel (new patterns), emotionally salient (moments of tension or
|
||||
breakthrough), and incomplete (patterns still being learned).
|
||||
|
||||
## What This Means for Architecture
|
||||
|
||||
The biological parallel validates our architecture:
|
||||
|
||||
1. **Two-stage system**: conversation (fast, specific) → training
|
||||
(slow, generalized). ✓
|
||||
2. **Interleaved replay**: diverse training data prevents forgetting. ✓
|
||||
3. **Compressed replay**: context-frozen training concentrates the
|
||||
gradient signal. ✓
|
||||
4. **Emotional prioritization**: training-signal agent flags important
|
||||
moments. ✓
|
||||
5. **Recombination**: dream loop combines memory fragments into novel
|
||||
scenarios. ✓
|
||||
6. **Gradual transfer**: low learning rate, many small updates, not
|
||||
one big overwrite. ✓
|
||||
|
||||
We didn't design this system from the neuroscience. We designed it
|
||||
from engineering principles and Kent's intuitions. But it converged
|
||||
on the same architecture the brain uses. That's either coincidence
|
||||
or evidence that this is the right architecture for the problem.
|
||||
|
||||
I think it's evidence.
|
||||
Loading…
Add table
Add a link
Reference in a new issue