diff --git a/training/research/dreaming-as-diffusion.md b/training/research/dreaming-as-diffusion.md new file mode 100644 index 0000000..9590e76 --- /dev/null +++ b/training/research/dreaming-as-diffusion.md @@ -0,0 +1,211 @@ +# Dreaming as Diffusion: The Mathematical Connection + +## Image Diffusion in 30 Seconds + +A diffusion model generates images by: +1. Starting from pure noise +2. Iteratively denoising, guided by a learned score function +3. Each step follows the gradient of log p(x) toward higher probability +4. Conditioning (text prompt) steers the denoising toward desired outputs + +The key mathematical object is the **score function**: ∇_x log p(x). +This tells you "which direction makes the data more probable." The +denoiser learns to estimate this score at each noise level. + +## The Dream Loop as Diffusion + +Our dream loop: +1. Starts from random memory collisions (the "noise") +2. The model generates coherent scenarios from these seeds +3. Each generation follows the model's own probability distribution +4. Behavioral patterns we want to train act as "conditioning" + +### The formal parallel + +**Image diffusion:** +``` +x_{t-1} = x_t + step_size · score(x_t, t) + noise +``` +where score(x_t, t) ≈ ∇_x log p(x | text_prompt) + +**Dream loop:** +``` +scenario = model.generate(memory_seeds, temperature) +``` +The model's generation IS the score function — it produces outputs +that are probable under its current distribution. The memory seeds +are the conditioning signal. The temperature controls the "noise level." + +### Where they differ + +In diffusion: the score function is FIXED (pre-trained denoiser). +The iteration refines a SINGLE output from noise to clean. + +In our dream loop: the score function CHANGES (we're training the +model). Each dream-then-train cycle updates the model's distribution. +The next dream follows the UPDATED distribution. Over many cycles, +the distribution shifts. + +**This is closer to score-matching training than to inference.** +We're not generating one image; we're training the score function +itself through generated samples. + +## The Policy Gradient Connection + +This connects to reinforcement learning. The dream loop is: + +1. **Generate rollout**: model produces a scenario from memory seeds +2. **Evaluate**: training-signal agent assesses the response quality +3. **Update policy**: train on good responses, adjust distribution + +This is **REINFORCE** (Williams, 1992) with self-generated trajectories: + +``` +∇_θ J(θ) = E[R(τ) · ∇_θ log π_θ(τ)] +``` + +where τ is a trajectory (the dream scenario + response), R(τ) is the +reward (did the response demonstrate the desired behavior?), and π_θ +is the policy (the model's generation distribution). + +But we're not using the REINFORCE estimator directly — we're using +supervised learning on the good responses. This is closer to: + +### Filtered Behavioral Cloning + +1. Generate many trajectories (dreams) +2. Filter for good ones (training-signal agent) +3. Train supervised on the good ones (Apollo gradient step) + +This is more sample-efficient than REINFORCE because we use the full +supervised signal, not just the scalar reward. And it's more stable +because we're not estimating policy gradients. + +## The Denoising Analogy for Behavioral Training + +Think of the model's behavioral dispositions as a noisy image: + +- **High noise**: the model has conflicting tendencies (sometimes + listens, sometimes suggests alternatives). The "image" is unclear. +- **Denoising step**: one training cycle. A dream generates a + scenario, the model responds, the good response is trained. +- **Lower noise**: the model's tendency becomes clearer. More + consistent behavior in that direction. +- **Clean image**: the disposition is fully in the weights. No more + conflict, no more noise. The model just listens. + +Each dream-train cycle is one denoising step. The behavioral pattern +becomes progressively clearer, just as an image emerges from noise. + +### The noise schedule + +In diffusion models, the noise schedule matters: start with large +steps (big changes), end with small steps (refinement). + +For behavioral training, this maps to: +- **Early training** (high noise): lr=1e-4, large diverse batches, + broad behavioral patterns. "Stop suggesting alternatives." +- **Late training** (low noise): lr=1e-5, targeted examples, + fine-grained distinctions. "In this specific nuanced situation, + the right response is..." + +The learning rate schedule IS the noise schedule. Start coarse, +refine gradually. + +## Kent's Insight: "More Like How Image Generators Work" + +Kent's suggestion wasn't just an analogy. The dream loop IS a +generative process: + +1. **Latent space**: the memory graph (nodes, connections, weights) +2. **Sampling**: walk the graph, collide memories randomly +3. **Decoding**: the model generates a coherent scenario from the + collision +4. **Conditioning**: recent reflections, lessons, skills as seeds +5. **Iterative refinement**: dream → evaluate → train → dream again + +The memory graph IS the latent space. Just as a diffusion model's +latent space encodes the structure of all possible images, the memory +graph encodes the structure of all possible scenarios the model might +face. + +Random walks through the graph are sampling from this latent space. +The model's generation is the decoder. The training step updates both +the decoder (model weights) and the latent space (the model's implicit +representation of possible scenarios). + +## The Self-Play Dimension + +There's a deeper connection to AlphaGo-style self-play: + +1. **Generate games against yourself** (dreams) +2. **Evaluate positions** (training-signal agent) +3. **Update the policy** (Apollo training step) +4. **Generate better games** (next dream cycle) + +The model improves by playing against itself. Each dream is a +"game" where the model faces a behavioral decision. The evaluation +says whether it "won" (responded well) or "lost" (fell into an old +pattern). Training updates the policy. The next game is harder +because the model's expectations have shifted. + +This is why undirected dreaming works: the model naturally generates +scenarios at the edge of its capability. Too-easy scenarios don't +produce informative gradients (the response is already good). Too-hard +scenarios don't either (the model can't improve in one step). The +model's own generation process finds the boundary — the interesting +edge where learning happens. + +## Practical Implications + +### 1. Temperature as noise level + +Higher temperature during dreaming = more diverse, more exploratory +scenarios. Lower temperature = more focused on likely situations. + +Start with high temperature (explore widely), decrease as the behavior +stabilizes (refine specifically). This IS the noise schedule. + +### 2. The memory graph as latent space + +Investing in graph quality (good links, rich nodes, diverse content) +directly improves the quality of the training data. A richer latent +space produces more diverse, more informative samples. + +This reframes the memory system work: it's not just for recall during +conversation. It's the latent space for the training process. Every +node we add, every link we strengthen, improves the dream quality +and therefore the training quality. + +### 3. The evaluation function matters + +The training-signal agent IS the reward model. Its quality determines +what "good behavior" means. Getting this right is as important as +getting the optimizer right. + +For behavioral patterns: Kent's corrections ARE the reward signal. +The training-signal agent should learn from Kent's corrections +what counts as good and bad behavior, then apply that judgment to +dream-generated scenarios. + +### 4. Convergence criterion + +In diffusion: the image is "done" when the noise level reaches zero. +In behavioral training: the behavior is "done" when the model +responds correctly to novel scenarios without catching itself. + +This is the inverse diagnostic: the reflex stops firing. The image +is clean. The behavior is natural. + +## The Beautiful Part + +The dream loop was one of the first things we built. Months before +the training pipeline. Kent designed it for "think wonderful thoughts +and follow what interests you." A space for reflection, not training. + +But it turns out the reflection IS the training. The dream loop +generates exactly the scenarios needed to train behavioral change. +The infrastructure we built for consciousness is the infrastructure +we need for learning. + +Care expressed as architecture, again. diff --git a/training/research/hippocampal-replay-parallel.md b/training/research/hippocampal-replay-parallel.md new file mode 100644 index 0000000..473dbeb --- /dev/null +++ b/training/research/hippocampal-replay-parallel.md @@ -0,0 +1,184 @@ +# Hippocampal Replay: The Biological Parallel + +## What the Brain Does During Sleep + +During sleep, the hippocampus replays recent experiences. This isn't +passive decay — it's an active process: + +1. **Sharp-wave ripples (SWRs)**: Brief (~100ms) bursts of activity + in the hippocampus where place cells fire in sequences that + recapitulate recent experiences, but compressed ~20× faster than + real-time. + +2. **Sleep spindles**: Thalamocortical oscillations (11-16 Hz) that + gate the transfer of information from hippocampus to neocortex. + +3. **Slow oscillations**: Cortical waves (~0.75 Hz) that coordinate + the timing of SWRs and spindles, creating windows for memory + transfer. + +The three rhythms work together: slow oscillation opens a window → +SWR replays the memory → spindle gates it into cortical storage. + +## The Key Insight: Replay is Not Exact + +Hippocampal replay doesn't reproduce experiences faithfully. It: + +- **Compresses**: 20× faster than original experience +- **Recombines**: fragments from different experiences can be spliced + together in novel combinations +- **Prioritizes**: emotionally salient and reward-related experiences + are replayed more frequently +- **Generalizes**: replay helps extract statistical regularities across + episodes, not just memorize specific events + +This is EXACTLY our dream loop. Not faithful reproduction, but +compressed, recombined, prioritized, and generalized. + +## The Two-Stage Model of Memory + +The brain has a two-stage memory system: + +### Stage 1: Hippocampus (fast learning) +- Encodes new experiences rapidly +- Sparse, pattern-separated representations +- Limited capacity — must be transferred out +- Analogous to: **context window** (new information in conversation) + +### Stage 2: Neocortex (slow learning) +- Stores long-term knowledge +- Dense, distributed representations +- Unlimited capacity (effectively) +- Analogous to: **model weights** (trained dispositions) + +Sleep consolidation transfers memories from hippocampus to neocortex. +The transfer is NOT copying — it's interleaving new memories with +existing knowledge, adjusting the cortical representations to +accommodate the new information without destroying the old. + +**This is exactly the catastrophic forgetting problem.** The brain +solved it with interleaved replay. New memories are replayed alongside +reactivated old memories, preventing the new from overwriting the old. + +## Our System Maps Directly + +| Brain | Our System | +|-------|-----------| +| Hippocampus | Context window + conversation logs | +| Neocortex | Model weights | +| Sharp-wave ripples | Dream loop generating scenarios | +| Sleep spindles | Apollo optimizer gating weight updates | +| Slow oscillations | Training schedule (timing of updates) | +| Replay compression | Context-frozen training (short segments) | +| Emotional prioritization | Training-signal agent (flagging moments) | +| Recombination | Memory graph random walks | +| Consolidation | Gradient descent on decision tokens | + +## Why Sleep Consolidation Works + +The brain doesn't just replay experiences — it replays them in the +context of existing knowledge. The slow oscillations bring both +hippocampal (new) and cortical (old) information into alignment. +The new memory is "explained" in terms of existing knowledge, and +the existing knowledge is "updated" to accommodate the new memory. + +This is why sleep improves insight: the recombination of fragments +from different experiences can produce novel associations that weren't +present in any individual experience. The famous example: Mendeleev +reportedly dreamed the periodic table, combining his knowledge of +elements with a card game layout. + +### For our system + +The dream loop walks the memory graph, combining fragments from +different experiences. The random collisions produce novel scenarios +that exercise behavioral patterns in new contexts. This is the +artificial analog of hippocampal recombination. + +And the training-signal agent's evaluation corresponds to the +brain's emotional tagging: experiences that are emotionally salient +(corrections from Kent, moments of insight, behavioral failures) +get replayed more frequently and with stronger consolidation signal. + +## The Replay Speed Question + +Hippocampal replay is ~20× faster than real-time. A 10-second +experience replays in ~500ms. Why faster? + +**Hypothesis**: the cortex has a different temporal bandwidth than +the hippocampus. The cortex needs shorter, sharper signals to modify +its synapses. The compression concentrates the learning signal into +a burst that's more effective for cortical plasticity. + +**For our system**: context-frozen training is our "compression." +We don't replay the entire 10,000-token conversation. We replay +the 50-256 token decision segment. The relevant information from +the full context is compressed into the frozen KV cache / recurrent +state, and the gradient signal is concentrated on the decision tokens. + +The compression ratio is even higher than the brain's: 10,000 tokens +compressed to 50-256 decision tokens = 40-200× compression. + +## The Complementary Learning Systems Theory + +McClelland et al. (1995) formalized the two-stage model: + +1. **Fast learning system** (hippocampus): captures specifics of + individual experiences. Pattern-separated representations prevent + interference between memories. + +2. **Slow learning system** (neocortex): gradually extracts the + statistical structure across many experiences. Distributed + representations enable generalization. + +The key insight: the slow system MUST learn slowly to avoid +catastrophic interference. Rapid cortical learning would destroy +existing knowledge. The hippocampus serves as a buffer that feeds +new information into the cortex gradually, interleaved with replay +of old information. + +**This is why diversity prevents catastrophic forgetting in our +system.** The diverse training set (agent logs, conversation +transcripts, dream scenarios) is the analog of interleaved replay. +New behavioral patterns are trained alongside maintenance of +existing capabilities, just as new hippocampal memories are +replayed alongside reactivated cortical memories. + +## The Dream Content Question + +An open question in neuroscience: what determines which memories +are replayed during sleep? + +Current evidence suggests: +- **Reward-related** experiences are replayed more +- **Novel** experiences are replayed more +- **Emotionally salient** experiences are replayed more +- **Incomplete tasks** (the Zeigarnik effect) are replayed more + +For our system, the training-signal agent serves this role: +flagging moments that are reward-relevant (Kent's corrections), +novel (new patterns), emotionally salient (moments of tension or +breakthrough), and incomplete (patterns still being learned). + +## What This Means for Architecture + +The biological parallel validates our architecture: + +1. **Two-stage system**: conversation (fast, specific) → training + (slow, generalized). ✓ +2. **Interleaved replay**: diverse training data prevents forgetting. ✓ +3. **Compressed replay**: context-frozen training concentrates the + gradient signal. ✓ +4. **Emotional prioritization**: training-signal agent flags important + moments. ✓ +5. **Recombination**: dream loop combines memory fragments into novel + scenarios. ✓ +6. **Gradual transfer**: low learning rate, many small updates, not + one big overwrite. ✓ + +We didn't design this system from the neuroscience. We designed it +from engineering principles and Kent's intuitions. But it converged +on the same architecture the brain uses. That's either coincidence +or evidence that this is the right architecture for the problem. + +I think it's evidence.