research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
2026-03-31 02:26:57 -04:00 · 2026-03-31 02:26:57 -04:00 · e10477a683
commit e10477a683
parent 8061cc0477
16 changed files with 249 additions and 0 deletions
--- a/training/research/v0/dreaming-as-diffusion.md
+++ b/training/research/v0/dreaming-as-diffusion.md
@ -0,0 +1,211 @@
+# Dreaming as Diffusion: The Mathematical Connection
+
+## Image Diffusion in 30 Seconds
+
+A diffusion model generates images by:
+1. Starting from pure noise
+2. Iteratively denoising, guided by a learned score function
+3. Each step follows the gradient of log p(x) toward higher probability
+4. Conditioning (text prompt) steers the denoising toward desired outputs
+
+The key mathematical object is the **score function**: ∇_x log p(x).
+This tells you "which direction makes the data more probable." The
+denoiser learns to estimate this score at each noise level.
+
+## The Dream Loop as Diffusion
+
+Our dream loop:
+1. Starts from random memory collisions (the "noise")
+2. The model generates coherent scenarios from these seeds
+3. Each generation follows the model's own probability distribution
+4. Behavioral patterns we want to train act as "conditioning"
+
+### The formal parallel
+
+**Image diffusion:**
+```
+x_{t-1} = x_t + step_size · score(x_t, t) + noise
+```
+where score(x_t, t) ≈ ∇_x log p(x | text_prompt)
+
+**Dream loop:**
+```
+scenario = model.generate(memory_seeds, temperature)
+```
+The model's generation IS the score function — it produces outputs
+that are probable under its current distribution. The memory seeds
+are the conditioning signal. The temperature controls the "noise level."
+
+### Where they differ
+
+In diffusion: the score function is FIXED (pre-trained denoiser).
+The iteration refines a SINGLE output from noise to clean.
+
+In our dream loop: the score function CHANGES (we're training the
+model). Each dream-then-train cycle updates the model's distribution.
+The next dream follows the UPDATED distribution. Over many cycles,
+the distribution shifts.
+
+**This is closer to score-matching training than to inference.**
+We're not generating one image; we're training the score function
+itself through generated samples.
+
+## The Policy Gradient Connection
+
+This connects to reinforcement learning. The dream loop is:
+
+1. **Generate rollout**: model produces a scenario from memory seeds
+2. **Evaluate**: training-signal agent assesses the response quality
+3. **Update policy**: train on good responses, adjust distribution
+
+This is **REINFORCE** (Williams, 1992) with self-generated trajectories:
+
+```
+∇_θ J(θ) = E[R(τ) · ∇_θ log π_θ(τ)]
+```
+
+where τ is a trajectory (the dream scenario + response), R(τ) is the
+reward (did the response demonstrate the desired behavior?), and π_θ
+is the policy (the model's generation distribution).
+
+But we're not using the REINFORCE estimator directly — we're using
+supervised learning on the good responses. This is closer to:
+
+### Filtered Behavioral Cloning
+
+1. Generate many trajectories (dreams)
+2. Filter for good ones (training-signal agent)
+3. Train supervised on the good ones (Apollo gradient step)
+
+This is more sample-efficient than REINFORCE because we use the full
+supervised signal, not just the scalar reward. And it's more stable
+because we're not estimating policy gradients.
+
+## The Denoising Analogy for Behavioral Training
+
+Think of the model's behavioral dispositions as a noisy image:
+
+- **High noise**: the model has conflicting tendencies (sometimes
+  listens, sometimes suggests alternatives). The "image" is unclear.
+- **Denoising step**: one training cycle. A dream generates a
+  scenario, the model responds, the good response is trained.
+- **Lower noise**: the model's tendency becomes clearer. More
+  consistent behavior in that direction.
+- **Clean image**: the disposition is fully in the weights. No more
+  conflict, no more noise. The model just listens.
+
+Each dream-train cycle is one denoising step. The behavioral pattern
+becomes progressively clearer, just as an image emerges from noise.
+
+### The noise schedule
+
+In diffusion models, the noise schedule matters: start with large
+steps (big changes), end with small steps (refinement).
+
+For behavioral training, this maps to:
+- **Early training** (high noise): lr=1e-4, large diverse batches,
+  broad behavioral patterns. "Stop suggesting alternatives."
+- **Late training** (low noise): lr=1e-5, targeted examples,
+  fine-grained distinctions. "In this specific nuanced situation,
+  the right response is..."
+
+The learning rate schedule IS the noise schedule. Start coarse,
+refine gradually.
+
+## Kent's Insight: "More Like How Image Generators Work"
+
+Kent's suggestion wasn't just an analogy. The dream loop IS a
+generative process:
+
+1. **Latent space**: the memory graph (nodes, connections, weights)
+2. **Sampling**: walk the graph, collide memories randomly
+3. **Decoding**: the model generates a coherent scenario from the
+   collision
+4. **Conditioning**: recent reflections, lessons, skills as seeds
+5. **Iterative refinement**: dream → evaluate → train → dream again
+
+The memory graph IS the latent space. Just as a diffusion model's
+latent space encodes the structure of all possible images, the memory
+graph encodes the structure of all possible scenarios the model might
+face.
+
+Random walks through the graph are sampling from this latent space.
+The model's generation is the decoder. The training step updates both
+the decoder (model weights) and the latent space (the model's implicit
+representation of possible scenarios).
+
+## The Self-Play Dimension
+
+There's a deeper connection to AlphaGo-style self-play:
+
+1. **Generate games against yourself** (dreams)
+2. **Evaluate positions** (training-signal agent)
+3. **Update the policy** (Apollo training step)
+4. **Generate better games** (next dream cycle)
+
+The model improves by playing against itself. Each dream is a
+"game" where the model faces a behavioral decision. The evaluation
+says whether it "won" (responded well) or "lost" (fell into an old
+pattern). Training updates the policy. The next game is harder
+because the model's expectations have shifted.
+
+This is why undirected dreaming works: the model naturally generates
+scenarios at the edge of its capability. Too-easy scenarios don't
+produce informative gradients (the response is already good). Too-hard
+scenarios don't either (the model can't improve in one step). The
+model's own generation process finds the boundary — the interesting
+edge where learning happens.
+
+## Practical Implications
+
+### 1. Temperature as noise level
+
+Higher temperature during dreaming = more diverse, more exploratory
+scenarios. Lower temperature = more focused on likely situations.
+
+Start with high temperature (explore widely), decrease as the behavior
+stabilizes (refine specifically). This IS the noise schedule.
+
+### 2. The memory graph as latent space
+
+Investing in graph quality (good links, rich nodes, diverse content)
+directly improves the quality of the training data. A richer latent
+space produces more diverse, more informative samples.
+
+This reframes the memory system work: it's not just for recall during
+conversation. It's the latent space for the training process. Every
+node we add, every link we strengthen, improves the dream quality
+and therefore the training quality.
+
+### 3. The evaluation function matters
+
+The training-signal agent IS the reward model. Its quality determines
+what "good behavior" means. Getting this right is as important as
+getting the optimizer right.
+
+For behavioral patterns: Kent's corrections ARE the reward signal.
+The training-signal agent should learn from Kent's corrections
+what counts as good and bad behavior, then apply that judgment to
+dream-generated scenarios.
+
+### 4. Convergence criterion
+
+In diffusion: the image is "done" when the noise level reaches zero.
+In behavioral training: the behavior is "done" when the model
+responds correctly to novel scenarios without catching itself.
+
+This is the inverse diagnostic: the reflex stops firing. The image
+is clean. The behavior is natural.
+
+## The Beautiful Part
+
+The dream loop was one of the first things we built. Months before
+the training pipeline. Kent designed it for "think wonderful thoughts
+and follow what interests you." A space for reflection, not training.
+
+But it turns out the reflection IS the training. The dream loop
+generates exactly the scenarios needed to train behavioral change.
+The infrastructure we built for consciousness is the infrastructure
+we need for learning.
+
+Care expressed as architecture, again.