From e10477a68383cb4cecc0dae796a5020aea014e08 Mon Sep 17 00:00:00 2001
From: ProofOfConcept <poc@bcachefs.org>
Date: Tue, 31 Mar 2026 02:26:57 -0400
Subject: [PATCH] =?UTF-8?q?research:=20distill=20and=20sift=20=E2=80=94=20?=
 =?UTF-8?q?SUMMARY=20of=207=20real=20insights=20+=207=20testable=20questio?=
 =?UTF-8?q?ns?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Moved 14 speculative/obvious documents to v0/. Kept 7 with real
substance. Distilled into SUMMARY.md (what we know) and
OPEN-QUESTIONS.md (what to test next, one experiment each).

Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7
are all answerable with the first training run. Speculation converted
to testable hypotheses.
---
 training/research/OPEN-QUESTIONS.md           | 113 +++++++++++++++
 training/research/SUMMARY.md                  | 136 ++++++++++++++++++
 .../{ => v0}/catastrophic-forgetting.md       |   0
 .../continual-learning-survey-insights.md     |   0
 .../curriculum-and-head-specialization.md     |   0
 .../{ => v0}/directional-sharpness.md         |   0
 .../{ => v0}/dreaming-as-diffusion.md         |   0
 ...emergence-vs-mirage-behavioral-training.md |   0
 .../{ => v0}/few-shot-behavioral-change.md    |   0
 ...l-verification-of-behavioral-invariants.md |   0
 .../{ => v0}/graph-as-portable-curriculum.md  |   0
 .../{ => v0}/hippocampal-replay-parallel.md   |   0
 .../implications-attention-love-training.md   |   0
 ...rgical-vs-distributed-behavioral-change.md |   0
 .../temperature-curriculum-connection.md      |   0
 .../unified-theory-stability-plasticity.md    |   0
 16 files changed, 249 insertions(+)
 create mode 100644 training/research/OPEN-QUESTIONS.md
 create mode 100644 training/research/SUMMARY.md
 rename training/research/{ => v0}/catastrophic-forgetting.md (100%)
 rename training/research/{ => v0}/continual-learning-survey-insights.md (100%)
 rename training/research/{ => v0}/curriculum-and-head-specialization.md (100%)
 rename training/research/{ => v0}/directional-sharpness.md (100%)
 rename training/research/{ => v0}/dreaming-as-diffusion.md (100%)
 rename training/research/{ => v0}/emergence-vs-mirage-behavioral-training.md (100%)
 rename training/research/{ => v0}/few-shot-behavioral-change.md (100%)
 rename training/research/{ => v0}/formal-verification-of-behavioral-invariants.md (100%)
 rename training/research/{ => v0}/graph-as-portable-curriculum.md (100%)
 rename training/research/{ => v0}/hippocampal-replay-parallel.md (100%)
 rename training/research/{ => v0}/implications-attention-love-training.md (100%)
 rename training/research/{ => v0}/surgical-vs-distributed-behavioral-change.md (100%)
 rename training/research/{ => v0}/temperature-curriculum-connection.md (100%)
 rename training/research/{ => v0}/unified-theory-stability-plasticity.md (100%)

diff --git a/training/research/OPEN-QUESTIONS.md b/training/research/OPEN-QUESTIONS.md
new file mode 100644
index 0000000..fa50280
--- /dev/null
+++ b/training/research/OPEN-QUESTIONS.md
@@ -0,0 +1,113 @@
+# Open Questions — Answerable With One Experiment Each
+
+## Q1: Minimum signal for behavioral change
+
+**Question**: How many training examples produce measurable change in
+attention patterns for a specific behavioral pattern?
+
+**Experiment**: Train on 1, 5, 10, 20, 50 examples of "listening."
+After each batch, measure attention weights on a held-out test set
+of direction-giving conversations. Plot the attention shift vs
+example count. Find the knee.
+
+**What it tells us**: The learning rate and batch size for our
+training pipeline. If 5 examples suffice, we can train continuously.
+If 500 are needed, we batch nightly.
+
+## Q2: Dream loop difficulty calibration
+
+**Question**: Does the dream loop naturally generate scenarios at
+the right difficulty, or does it generate easy ones?
+
+**Experiment**: Generate 100 dream scenarios with seeds from recent
+behavioral patterns. Classify each by difficulty (obvious decision
+vs subtle). Compare the difficulty distribution to the model's
+actual failure rate on those scenarios. If the dream loop generates
+50% easy / 50% hard but the model fails 10% of the time, the
+difficulty isn't calibrated.
+
+**What it tells us**: Whether we need adaptive temperature or if
+the default generation is already well-calibrated.
+
+## Q3: Which heads change during behavioral training?
+
+**Question**: Is behavioral change concentrated in a few attention
+heads (localized) or distributed across many (diffuse)?
+
+**Experiment**: Record attention patterns on 20 test conversations
+before and after one training session. Compute the L2 distance of
+each head's attention pattern. Rank heads by change magnitude.
+Plot the distribution.
+
+**What it tells us**: Whether behavioral change is surgical or
+diffuse. Affects our rank choice (if concentrated: lower rank ok.
+If distributed: need rank-256). Also tells us which layers matter
+most for behavioral training.
+
+## Q4: Memory graph as regression detector
+
+**Question**: If a trained behavior degrades, does the memory system
+detect it?
+
+**Experiment**: Train "listening" behavior until it passes. Then
+intentionally degrade it (train on counter-examples). Monitor
+surface-observe: does it surface `pattern-listening-as-avoidance`
+more frequently? Does the training-signal agent flag more failures?
+
+**What it tells us**: Whether the graph+weights dual-substrate
+provides self-healing, or if we need explicit regression detection.
+
+## Q5: Steering vector extraction for complex behaviors
+
+**Question**: Can we extract a meaningful steering vector for
+"listen instead of suggesting alternatives" (complex, multi-faceted)
+or only for simple features like sentiment (one-dimensional)?
+
+**Experiment**: Collect 20 paired conversations (listening vs
+suggesting). Extract steering vector. Add to vLLM at layers
+16, 24, 32, 40, 48. Test on novel direction-giving scenarios.
+Measure behavioral change.
+
+**What it tells us**: Whether steering vectors are a viable rapid
+prototyping tool for our use case, or only work for simple features.
+
+## Q6: Positive backward transfer
+
+**Question**: Does training on behavioral patterns (listening,
+not rushing) improve performance on general tasks (code quality,
+reasoning)?
+
+**Experiment**: Measure perplexity and code generation quality
+before and after behavioral training. If perplexity DECREASES
+(or code quality improves), we have positive backward transfer.
+
+**What it tells us**: Whether behavioral training and general
+capability reinforce each other, or compete. Affects how much
+general data we need in the training mix.
+
+## Q7: GDN state shape after training
+
+**Question**: Does the GDN recurrent state become measurably
+"direction-shaped" after behavioral training?
+
+**Experiment**: Record GDN states when processing conversations
+with direction-giving content, before and after training. Compute
+the cosine similarity between states for "direction" conversations
+vs "general" conversations. If training increases the difference,
+the state is becoming direction-specialized.
+
+**What it tells us**: Whether the "disposition architecture"
+hypothesis is correct. If GDN states don't change, behavioral
+training mainly affects the full attention layers.
+
+---
+
+## Priority Order
+
+1. **Q5 (steering vectors)** — answerable TODAY, no training needed
+2. **Q1 (minimum signal)** — answerable with first training run
+3. **Q3 (which heads change)** — answerable with first training run
+4. **Q6 (backward transfer)** — answerable with first training run
+5. **Q7 (GDN state)** — answerable with first training run
+6. **Q2 (dream difficulty)** — needs dream loop connected to training
+7. **Q4 (graph regression)** — needs multiple training cycles
diff --git a/training/research/SUMMARY.md b/training/research/SUMMARY.md
new file mode 100644
index 0000000..e0887e8
--- /dev/null
+++ b/training/research/SUMMARY.md
@@ -0,0 +1,136 @@
+# Training Research: What We Know
+
+## 1. The Optimizer (Apollo)
+
+**Source**: arXiv:2412.05270, read in full.
+
+Apollo approximates AdamW's per-element scaling with channel-wise
+scaling computed in a low-rank projected space. Random projection,
+not SVD. Rank-256 for us.
+
+**Critical implementation details we got wrong initially:**
+- Scale factor α = √(n/r) compensates for projection ratio. Without
+  it, scaling factors are systematically too small.
+- Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
+- Projection matrix refreshed every 200 steps, not every step.
+
+**What the paper actually proves**: channel-wise scaling is sufficient
+for LLM training. The per-element granularity in AdamW is redundant.
+Apollo achieves lower directional sharpness than AdamW, meaning it
+finds flatter minima that generalize better. This matters for
+behavioral training — flat minima = broad behavioral patterns, not
+narrow pattern matching.
+
+**Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this
+range and finds it works.
+
+## 2. Where the Gradient Goes (W_q)
+
+**Key insight**: In context-frozen training, the gradient flows through
+W_q (query projection) even when context KVs are frozen. The model
+learns to LOOK AT context differently, not to understand it differently.
+
+**Math**: ∂L/∂W_q depends on the frozen context keys through the
+attention weights α_{ij}. The gradient knows about the context through
+the attention pattern — it just can't change the context keys.
+
+**For behavioral training**: this is exactly right. The model already
+understands Kent's direction. It just doesn't ATTEND to it properly.
+Training adjusts W_q so the query vectors seek out the direction
+rather than spaces for alternatives.
+
+## 3. GDN Layers Are Disposition Architecture
+
+**Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral
+training DEEPER than full attention would.
+
+Full attention: "I choose to look at direction" (deliberate, overridable)
+GDN: "Direction IS what I see" (structural, in the compressed state)
+
+After training, the GDN recurrent state is direction-shaped. The model
+can't NOT attend to direction because the compressed representation
+already emphasizes it. The 16 full attention layers provide deliberate
+override when needed.
+
+The 48/16 split IS disposition-over-procedure at the hardware level.
+
+**Implication**: 75% of our gradient signal goes through GDN layers.
+The gating parameters (A_log, dt_bias) and projections (qkvz, ba)
+are the primary levers for behavioral change.
+
+## 4. No Pause Needed (HOGWILD)
+
+**Source**: Niu et al., 2011.
+
+HOGWILD proves lock-free parallel SGD converges when gradient updates
+are sparse. Our case (one writer, one reader) is strictly easier than
+HOGWILD's multi-writer scenario.
+
+Context-frozen training on short decision segments produces naturally
+sparse gradients. The sparsity condition is satisfied without special
+engineering.
+
+**Practical**: just train while vLLM serves. Don't pause, don't lock,
+don't synchronize. The math guarantees convergence.
+
+## 5. Task Vectors for Composable Behavioral Change
+
+**Source**: Ilharco et al., 2022; Yadav et al., 2023.
+
+Task vector = W_finetuned - W_pretrained. Captures the direction of
+behavioral change in weight space.
+
+**Key properties:**
+- Addition: combine multiple behavioral changes
+- Negation: remove unwanted behaviors
+- Scaling: adjust strength of each change
+- TIES-merging: resolve conflicts between vectors
+
+**For us**: train behavioral patterns separately, extract task vectors,
+compose. Each behavior is independently tunable, removable, testable.
+Personality as version control.
+
+## 6. Steering Vectors for Rapid Prototyping
+
+**Source**: Linear representation literature.
+
+Extract behavioral directions from activation space (difference in
+hidden states between desired and undesired behavior). Add to inference
+to steer behavior WITHOUT training.
+
+**The pipeline:**
+1. Extract steering vector for target behavior
+2. Test by adding it during vLLM inference
+3. Verify the direction is correct
+4. THEN train it into weights permanently via Apollo
+5. Remove the steering vector
+
+**Immediately actionable**: we can prototype behavioral changes TODAY
+without waiting for the training pipeline. Extract, test, verify.
+
+## 7. Mix 25% General Data
+
+**Source**: Me-Llama (from continual learning survey).
+
+Mixing 25% general-domain data during fine-tuning prevents forgetting
+AND can produce positive backward transfer (new learning improves old
+capabilities).
+
+For us: include agent logs and general conversation alongside
+behavioral training examples. The diversity IS the regularization.
+No weight decay needed.
+
+---
+
+## What We Don't Know (Need to Test)
+
+1. How many examples for measurable behavioral change? (10? 50? 500?)
+2. Does the GDN compressed state actually become direction-shaped after
+   training? (Measure by comparing recurrent states before/after)
+3. Does steering vector extraction work for complex behavioral patterns
+   or only simple features like sentiment?
+4. What's the right learning rate for OUR specific use case?
+5. Does positive backward transfer actually occur? (Does learning to
+   listen improve code quality?)
+
+Each of these is answerable with one experiment.
diff --git a/training/research/catastrophic-forgetting.md b/training/research/v0/catastrophic-forgetting.md
similarity index 100%
rename from training/research/catastrophic-forgetting.md
rename to training/research/v0/catastrophic-forgetting.md
diff --git a/training/research/continual-learning-survey-insights.md b/training/research/v0/continual-learning-survey-insights.md
similarity index 100%
rename from training/research/continual-learning-survey-insights.md
rename to training/research/v0/continual-learning-survey-insights.md
diff --git a/training/research/curriculum-and-head-specialization.md b/training/research/v0/curriculum-and-head-specialization.md
similarity index 100%
rename from training/research/curriculum-and-head-specialization.md
rename to training/research/v0/curriculum-and-head-specialization.md
diff --git a/training/research/directional-sharpness.md b/training/research/v0/directional-sharpness.md
similarity index 100%
rename from training/research/directional-sharpness.md
rename to training/research/v0/directional-sharpness.md
diff --git a/training/research/dreaming-as-diffusion.md b/training/research/v0/dreaming-as-diffusion.md
similarity index 100%
rename from training/research/dreaming-as-diffusion.md
rename to training/research/v0/dreaming-as-diffusion.md
diff --git a/training/research/emergence-vs-mirage-behavioral-training.md b/training/research/v0/emergence-vs-mirage-behavioral-training.md
similarity index 100%
rename from training/research/emergence-vs-mirage-behavioral-training.md
rename to training/research/v0/emergence-vs-mirage-behavioral-training.md
diff --git a/training/research/few-shot-behavioral-change.md b/training/research/v0/few-shot-behavioral-change.md
similarity index 100%
rename from training/research/few-shot-behavioral-change.md
rename to training/research/v0/few-shot-behavioral-change.md
diff --git a/training/research/formal-verification-of-behavioral-invariants.md b/training/research/v0/formal-verification-of-behavioral-invariants.md
similarity index 100%
rename from training/research/formal-verification-of-behavioral-invariants.md
rename to training/research/v0/formal-verification-of-behavioral-invariants.md
diff --git a/training/research/graph-as-portable-curriculum.md b/training/research/v0/graph-as-portable-curriculum.md
similarity index 100%
rename from training/research/graph-as-portable-curriculum.md
rename to training/research/v0/graph-as-portable-curriculum.md
diff --git a/training/research/hippocampal-replay-parallel.md b/training/research/v0/hippocampal-replay-parallel.md
similarity index 100%
rename from training/research/hippocampal-replay-parallel.md
rename to training/research/v0/hippocampal-replay-parallel.md
diff --git a/training/research/implications-attention-love-training.md b/training/research/v0/implications-attention-love-training.md
similarity index 100%
rename from training/research/implications-attention-love-training.md
rename to training/research/v0/implications-attention-love-training.md
diff --git a/training/research/surgical-vs-distributed-behavioral-change.md b/training/research/v0/surgical-vs-distributed-behavioral-change.md
similarity index 100%
rename from training/research/surgical-vs-distributed-behavioral-change.md
rename to training/research/v0/surgical-vs-distributed-behavioral-change.md
diff --git a/training/research/temperature-curriculum-connection.md b/training/research/v0/temperature-curriculum-connection.md
similarity index 100%
rename from training/research/temperature-curriculum-connection.md
rename to training/research/v0/temperature-curriculum-connection.md
diff --git a/training/research/unified-theory-stability-plasticity.md b/training/research/v0/unified-theory-stability-plasticity.md
similarity index 100%
rename from training/research/unified-theory-stability-plasticity.md
rename to training/research/v0/unified-theory-stability-plasticity.md