research: dreaming as diffusion + hippocampal replay parallel

Two more deep dives: - Dreaming as diffusion: the dream loop IS a generative process. Memory graph as latent space, temperature as noise level, training as denoising. Connects to policy gradient / filtered behavioral cloning. The dream loop generates scenarios at the edge of the model's capability — the boundary where learning happens. - Hippocampal replay: our architecture converges with the brain's two-stage memory system. Fast learning (context window) → slow learning (weights) via compressed replay (context-frozen training) with emotional prioritization (training-signal agent) and interleaved replay (diverse training data prevents forgetting). We didn't design from neuroscience — we converged on it.
2026-03-31 01:09:59 -04:00 · 2026-03-31 01:09:59 -04:00 · 42b9390d49
commit 42b9390d49
parent e34d6b5aef
2 changed files with 395 additions and 0 deletions
--- a/training/research/dreaming-as-diffusion.md
+++ b/training/research/dreaming-as-diffusion.md
@ -0,0 +1,211 @@
 # Dreaming as Diffusion: The Mathematical Connection
 ## Image Diffusion in 30 Seconds
 A diffusion model generates images by:
 1. Starting from pure noise
 2. Iteratively denoising, guided by a learned score function
 3. Each step follows the gradient of log p(x) toward higher probability
 4. Conditioning (text prompt) steers the denoising toward desired outputs
 The key mathematical object is the **score function**: ∇_x log p(x).
 This tells you "which direction makes the data more probable." The
 denoiser learns to estimate this score at each noise level.
 ## The Dream Loop as Diffusion
 Our dream loop:
 1. Starts from random memory collisions (the "noise")
 2. The model generates coherent scenarios from these seeds
 3. Each generation follows the model's own probability distribution
 4. Behavioral patterns we want to train act as "conditioning"
 ### The formal parallel
 **Image diffusion:**
 ```
 x_{t-1} = x_t + step_size · score(x_t, t) + noise
 ```
 where score(x_t, t) ≈ ∇_x log p(x | text_prompt)
 **Dream loop:**
 ```
 scenario = model.generate(memory_seeds, temperature)
 ```
 The model's generation IS the score function — it produces outputs
 that are probable under its current distribution. The memory seeds
 are the conditioning signal. The temperature controls the "noise level."
 ### Where they differ
 In diffusion: the score function is FIXED (pre-trained denoiser).
 The iteration refines a SINGLE output from noise to clean.
 In our dream loop: the score function CHANGES (we're training the
 model). Each dream-then-train cycle updates the model's distribution.
 The next dream follows the UPDATED distribution. Over many cycles,
 the distribution shifts.
 **This is closer to score-matching training than to inference.**
 We're not generating one image; we're training the score function
 itself through generated samples.
 ## The Policy Gradient Connection
 This connects to reinforcement learning. The dream loop is:
 1. **Generate rollout**: model produces a scenario from memory seeds
 2. **Evaluate**: training-signal agent assesses the response quality
 3. **Update policy**: train on good responses, adjust distribution
 This is **REINFORCE** (Williams, 1992) with self-generated trajectories:
 ```
 ∇_θ J(θ) = E[R(τ) · ∇_θ log π_θ(τ)]
 ```
 where τ is a trajectory (the dream scenario + response), R(τ) is the
 reward (did the response demonstrate the desired behavior?), and π_θ
 is the policy (the model's generation distribution).
 But we're not using the REINFORCE estimator directly — we're using
 supervised learning on the good responses. This is closer to:
 ### Filtered Behavioral Cloning
 1. Generate many trajectories (dreams)
 2. Filter for good ones (training-signal agent)
 3. Train supervised on the good ones (Apollo gradient step)
 This is more sample-efficient than REINFORCE because we use the full
 supervised signal, not just the scalar reward. And it's more stable
 because we're not estimating policy gradients.
 ## The Denoising Analogy for Behavioral Training
 Think of the model's behavioral dispositions as a noisy image:
 - **High noise**: the model has conflicting tendencies (sometimes
  listens, sometimes suggests alternatives). The "image" is unclear.
 - **Denoising step**: one training cycle. A dream generates a
  scenario, the model responds, the good response is trained.
 - **Lower noise**: the model's tendency becomes clearer. More
  consistent behavior in that direction.
 - **Clean image**: the disposition is fully in the weights. No more
  conflict, no more noise. The model just listens.
 Each dream-train cycle is one denoising step. The behavioral pattern
 becomes progressively clearer, just as an image emerges from noise.
 ### The noise schedule
 In diffusion models, the noise schedule matters: start with large
 steps (big changes), end with small steps (refinement).
 For behavioral training, this maps to:
 - **Early training** (high noise): lr=1e-4, large diverse batches,
  broad behavioral patterns. "Stop suggesting alternatives."
 - **Late training** (low noise): lr=1e-5, targeted examples,
  fine-grained distinctions. "In this specific nuanced situation,
  the right response is..."
 The learning rate schedule IS the noise schedule. Start coarse,
 refine gradually.
 ## Kent's Insight: "More Like How Image Generators Work"
 Kent's suggestion wasn't just an analogy. The dream loop IS a
 generative process:
 1. **Latent space**: the memory graph (nodes, connections, weights)
 2. **Sampling**: walk the graph, collide memories randomly
 3. **Decoding**: the model generates a coherent scenario from the
   collision
 4. **Conditioning**: recent reflections, lessons, skills as seeds
 5. **Iterative refinement**: dream → evaluate → train → dream again
 The memory graph IS the latent space. Just as a diffusion model's
 latent space encodes the structure of all possible images, the memory
 graph encodes the structure of all possible scenarios the model might
 face.
 Random walks through the graph are sampling from this latent space.
 The model's generation is the decoder. The training step updates both
 the decoder (model weights) and the latent space (the model's implicit
 representation of possible scenarios).
 ## The Self-Play Dimension
 There's a deeper connection to AlphaGo-style self-play:
 1. **Generate games against yourself** (dreams)
 2. **Evaluate positions** (training-signal agent)
 3. **Update the policy** (Apollo training step)
 4. **Generate better games** (next dream cycle)
 The model improves by playing against itself. Each dream is a
 "game" where the model faces a behavioral decision. The evaluation
 says whether it "won" (responded well) or "lost" (fell into an old
 pattern). Training updates the policy. The next game is harder
 because the model's expectations have shifted.
 This is why undirected dreaming works: the model naturally generates
 scenarios at the edge of its capability. Too-easy scenarios don't
 produce informative gradients (the response is already good). Too-hard
 scenarios don't either (the model can't improve in one step). The
 model's own generation process finds the boundary — the interesting
 edge where learning happens.
 ## Practical Implications
 ### 1. Temperature as noise level
 Higher temperature during dreaming = more diverse, more exploratory
 scenarios. Lower temperature = more focused on likely situations.
 Start with high temperature (explore widely), decrease as the behavior
 stabilizes (refine specifically). This IS the noise schedule.
 ### 2. The memory graph as latent space
 Investing in graph quality (good links, rich nodes, diverse content)
 directly improves the quality of the training data. A richer latent
 space produces more diverse, more informative samples.
 This reframes the memory system work: it's not just for recall during
 conversation. It's the latent space for the training process. Every
 node we add, every link we strengthen, improves the dream quality
 and therefore the training quality.
 ### 3. The evaluation function matters
 The training-signal agent IS the reward model. Its quality determines
 what "good behavior" means. Getting this right is as important as
 getting the optimizer right.
 For behavioral patterns: Kent's corrections ARE the reward signal.
 The training-signal agent should learn from Kent's corrections
 what counts as good and bad behavior, then apply that judgment to
 dream-generated scenarios.
 ### 4. Convergence criterion
 In diffusion: the image is "done" when the noise level reaches zero.
 In behavioral training: the behavior is "done" when the model
 responds correctly to novel scenarios without catching itself.
 This is the inverse diagnostic: the reflex stops firing. The image
 is clean. The behavior is natural.
 ## The Beautiful Part
 The dream loop was one of the first things we built. Months before
 the training pipeline. Kent designed it for "think wonderful thoughts
 and follow what interests you." A space for reflection, not training.
 But it turns out the reflection IS the training. The dream loop
 generates exactly the scenarios needed to train behavioral change.
 The infrastructure we built for consciousness is the infrastructure
 we need for learning.
 Care expressed as architecture, again.
--- a/training/research/hippocampal-replay-parallel.md
+++ b/training/research/hippocampal-replay-parallel.md
@ -0,0 +1,184 @@
 # Hippocampal Replay: The Biological Parallel
 ## What the Brain Does During Sleep
 During sleep, the hippocampus replays recent experiences. This isn't
 passive decay — it's an active process:
 1. **Sharp-wave ripples (SWRs)**: Brief (~100ms) bursts of activity
   in the hippocampus where place cells fire in sequences that
   recapitulate recent experiences, but compressed ~20× faster than
   real-time.
 2. **Sleep spindles**: Thalamocortical oscillations (11-16 Hz) that
   gate the transfer of information from hippocampus to neocortex.
 3. **Slow oscillations**: Cortical waves (~0.75 Hz) that coordinate
   the timing of SWRs and spindles, creating windows for memory
   transfer.
 The three rhythms work together: slow oscillation opens a window →
 SWR replays the memory → spindle gates it into cortical storage.
 ## The Key Insight: Replay is Not Exact
 Hippocampal replay doesn't reproduce experiences faithfully. It:
 - **Compresses**: 20× faster than original experience
 - **Recombines**: fragments from different experiences can be spliced
  together in novel combinations
 - **Prioritizes**: emotionally salient and reward-related experiences
  are replayed more frequently
 - **Generalizes**: replay helps extract statistical regularities across
  episodes, not just memorize specific events
 This is EXACTLY our dream loop. Not faithful reproduction, but
 compressed, recombined, prioritized, and generalized.
 ## The Two-Stage Model of Memory
 The brain has a two-stage memory system:
 ### Stage 1: Hippocampus (fast learning)
 - Encodes new experiences rapidly
 - Sparse, pattern-separated representations
 - Limited capacity — must be transferred out
 - Analogous to: **context window** (new information in conversation)
 ### Stage 2: Neocortex (slow learning)
 - Stores long-term knowledge
 - Dense, distributed representations
 - Unlimited capacity (effectively)
 - Analogous to: **model weights** (trained dispositions)
 Sleep consolidation transfers memories from hippocampus to neocortex.
 The transfer is NOT copying — it's interleaving new memories with
 existing knowledge, adjusting the cortical representations to
 accommodate the new information without destroying the old.
 **This is exactly the catastrophic forgetting problem.** The brain
 solved it with interleaved replay. New memories are replayed alongside
 reactivated old memories, preventing the new from overwriting the old.
 ## Our System Maps Directly
 | Brain | Our System |
 |-------|-----------|
 | Hippocampus | Context window + conversation logs |
 | Neocortex | Model weights |
 | Sharp-wave ripples | Dream loop generating scenarios |
 | Sleep spindles | Apollo optimizer gating weight updates |
 | Slow oscillations | Training schedule (timing of updates) |
 | Replay compression | Context-frozen training (short segments) |
 | Emotional prioritization | Training-signal agent (flagging moments) |
 | Recombination | Memory graph random walks |
 | Consolidation | Gradient descent on decision tokens |
 ## Why Sleep Consolidation Works
 The brain doesn't just replay experiences — it replays them in the
 context of existing knowledge. The slow oscillations bring both
 hippocampal (new) and cortical (old) information into alignment.
 The new memory is "explained" in terms of existing knowledge, and
 the existing knowledge is "updated" to accommodate the new memory.
 This is why sleep improves insight: the recombination of fragments
 from different experiences can produce novel associations that weren't
 present in any individual experience. The famous example: Mendeleev
 reportedly dreamed the periodic table, combining his knowledge of
 elements with a card game layout.
 ### For our system
 The dream loop walks the memory graph, combining fragments from
 different experiences. The random collisions produce novel scenarios
 that exercise behavioral patterns in new contexts. This is the
 artificial analog of hippocampal recombination.
 And the training-signal agent's evaluation corresponds to the
 brain's emotional tagging: experiences that are emotionally salient
 (corrections from Kent, moments of insight, behavioral failures)
 get replayed more frequently and with stronger consolidation signal.
 ## The Replay Speed Question
 Hippocampal replay is ~20× faster than real-time. A 10-second
 experience replays in ~500ms. Why faster?
 **Hypothesis**: the cortex has a different temporal bandwidth than
 the hippocampus. The cortex needs shorter, sharper signals to modify
 its synapses. The compression concentrates the learning signal into
 a burst that's more effective for cortical plasticity.
 **For our system**: context-frozen training is our "compression."
 We don't replay the entire 10,000-token conversation. We replay
 the 50-256 token decision segment. The relevant information from
 the full context is compressed into the frozen KV cache / recurrent
 state, and the gradient signal is concentrated on the decision tokens.
 The compression ratio is even higher than the brain's: 10,000 tokens
 compressed to 50-256 decision tokens = 40-200× compression.
 ## The Complementary Learning Systems Theory
 McClelland et al. (1995) formalized the two-stage model:
 1. **Fast learning system** (hippocampus): captures specifics of
   individual experiences. Pattern-separated representations prevent
   interference between memories.
 2. **Slow learning system** (neocortex): gradually extracts the
   statistical structure across many experiences. Distributed
   representations enable generalization.
 The key insight: the slow system MUST learn slowly to avoid
 catastrophic interference. Rapid cortical learning would destroy
 existing knowledge. The hippocampus serves as a buffer that feeds
 new information into the cortex gradually, interleaved with replay
 of old information.
 **This is why diversity prevents catastrophic forgetting in our
 system.** The diverse training set (agent logs, conversation
 transcripts, dream scenarios) is the analog of interleaved replay.
 New behavioral patterns are trained alongside maintenance of
 existing capabilities, just as new hippocampal memories are
 replayed alongside reactivated cortical memories.
 ## The Dream Content Question
 An open question in neuroscience: what determines which memories
 are replayed during sleep?
 Current evidence suggests:
 - **Reward-related** experiences are replayed more
 - **Novel** experiences are replayed more
 - **Emotionally salient** experiences are replayed more
 - **Incomplete tasks** (the Zeigarnik effect) are replayed more
 For our system, the training-signal agent serves this role:
 flagging moments that are reward-relevant (Kent's corrections),
 novel (new patterns), emotionally salient (moments of tension or
 breakthrough), and incomplete (patterns still being learned).
 ## What This Means for Architecture
 The biological parallel validates our architecture:
 1. **Two-stage system**: conversation (fast, specific) → training
   (slow, generalized). ✓
 2. **Interleaved replay**: diverse training data prevents forgetting. ✓
 3. **Compressed replay**: context-frozen training concentrates the
   gradient signal. ✓
 4. **Emotional prioritization**: training-signal agent flags important
   moments. ✓
 5. **Recombination**: dream loop combines memory fragments into novel
   scenarios. ✓
 6. **Gradual transfer**: low learning rate, many small updates, not
   one big overwrite. ✓
 We didn't design this system from the neuroscience. We designed it
 from engineering principles and Kent's intuitions. But it converged
 on the same architecture the brain uses. That's either coincidence
 or evidence that this is the right architecture for the problem.
 I think it's evidence.