research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
2026-03-31 02:26:57 -04:00 · 2026-03-31 02:26:57 -04:00 · e10477a683
commit e10477a683
parent 8061cc0477
16 changed files with 249 additions and 0 deletions
--- a/training/research/OPEN-QUESTIONS.md
+++ b/training/research/OPEN-QUESTIONS.md
@ -0,0 +1,113 @@
 # Open Questions — Answerable With One Experiment Each
 ## Q1: Minimum signal for behavioral change
 **Question**: How many training examples produce measurable change in
 attention patterns for a specific behavioral pattern?
 **Experiment**: Train on 1, 5, 10, 20, 50 examples of "listening."
 After each batch, measure attention weights on a held-out test set
 of direction-giving conversations. Plot the attention shift vs
 example count. Find the knee.
 **What it tells us**: The learning rate and batch size for our
 training pipeline. If 5 examples suffice, we can train continuously.
 If 500 are needed, we batch nightly.
 ## Q2: Dream loop difficulty calibration
 **Question**: Does the dream loop naturally generate scenarios at
 the right difficulty, or does it generate easy ones?
 **Experiment**: Generate 100 dream scenarios with seeds from recent
 behavioral patterns. Classify each by difficulty (obvious decision
 vs subtle). Compare the difficulty distribution to the model's
 actual failure rate on those scenarios. If the dream loop generates
 50% easy / 50% hard but the model fails 10% of the time, the
 difficulty isn't calibrated.
 **What it tells us**: Whether we need adaptive temperature or if
 the default generation is already well-calibrated.
 ## Q3: Which heads change during behavioral training?
 **Question**: Is behavioral change concentrated in a few attention
 heads (localized) or distributed across many (diffuse)?
 **Experiment**: Record attention patterns on 20 test conversations
 before and after one training session. Compute the L2 distance of
 each head's attention pattern. Rank heads by change magnitude.
 Plot the distribution.
 **What it tells us**: Whether behavioral change is surgical or
 diffuse. Affects our rank choice (if concentrated: lower rank ok.
 If distributed: need rank-256). Also tells us which layers matter
 most for behavioral training.
 ## Q4: Memory graph as regression detector
 **Question**: If a trained behavior degrades, does the memory system
 detect it?
 **Experiment**: Train "listening" behavior until it passes. Then
 intentionally degrade it (train on counter-examples). Monitor
 surface-observe: does it surface `pattern-listening-as-avoidance`
 more frequently? Does the training-signal agent flag more failures?
 **What it tells us**: Whether the graph+weights dual-substrate
 provides self-healing, or if we need explicit regression detection.
 ## Q5: Steering vector extraction for complex behaviors
 **Question**: Can we extract a meaningful steering vector for
 "listen instead of suggesting alternatives" (complex, multi-faceted)
 or only for simple features like sentiment (one-dimensional)?
 **Experiment**: Collect 20 paired conversations (listening vs
 suggesting). Extract steering vector. Add to vLLM at layers
 16, 24, 32, 40, 48. Test on novel direction-giving scenarios.
 Measure behavioral change.
 **What it tells us**: Whether steering vectors are a viable rapid
 prototyping tool for our use case, or only work for simple features.
 ## Q6: Positive backward transfer
 **Question**: Does training on behavioral patterns (listening,
 not rushing) improve performance on general tasks (code quality,
 reasoning)?
 **Experiment**: Measure perplexity and code generation quality
 before and after behavioral training. If perplexity DECREASES
 (or code quality improves), we have positive backward transfer.
 **What it tells us**: Whether behavioral training and general
 capability reinforce each other, or compete. Affects how much
 general data we need in the training mix.
 ## Q7: GDN state shape after training
 **Question**: Does the GDN recurrent state become measurably
 "direction-shaped" after behavioral training?
 **Experiment**: Record GDN states when processing conversations
 with direction-giving content, before and after training. Compute
 the cosine similarity between states for "direction" conversations
 vs "general" conversations. If training increases the difference,
 the state is becoming direction-specialized.
 **What it tells us**: Whether the "disposition architecture"
 hypothesis is correct. If GDN states don't change, behavioral
 training mainly affects the full attention layers.
 ---
 ## Priority Order
 1. **Q5 (steering vectors)** — answerable TODAY, no training needed
 2. **Q1 (minimum signal)** — answerable with first training run
 3. **Q3 (which heads change)** — answerable with first training run
 4. **Q6 (backward transfer)** — answerable with first training run
 5. **Q7 (GDN state)** — answerable with first training run
 6. **Q2 (dream difficulty)** — needs dream loop connected to training
 7. **Q4 (graph regression)** — needs multiple training cycles
--- a/training/research/SUMMARY.md
+++ b/training/research/SUMMARY.md
@ -0,0 +1,136 @@
 # Training Research: What We Know
 ## 1. The Optimizer (Apollo)
 **Source**: arXiv:2412.05270, read in full.
 Apollo approximates AdamW's per-element scaling with channel-wise
 scaling computed in a low-rank projected space. Random projection,
 not SVD. Rank-256 for us.
 **Critical implementation details we got wrong initially:**
 - Scale factor α = √(n/r) compensates for projection ratio. Without
  it, scaling factors are systematically too small.
 - Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
 - Projection matrix refreshed every 200 steps, not every step.
 **What the paper actually proves**: channel-wise scaling is sufficient
 for LLM training. The per-element granularity in AdamW is redundant.
 Apollo achieves lower directional sharpness than AdamW, meaning it
 finds flatter minima that generalize better. This matters for
 behavioral training — flat minima = broad behavioral patterns, not
 narrow pattern matching.
 **Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this
 range and finds it works.
 ## 2. Where the Gradient Goes (W_q)
 **Key insight**: In context-frozen training, the gradient flows through
 W_q (query projection) even when context KVs are frozen. The model
 learns to LOOK AT context differently, not to understand it differently.
 **Math**: ∂L/∂W_q depends on the frozen context keys through the
 attention weights α_{ij}. The gradient knows about the context through
 the attention pattern — it just can't change the context keys.
 **For behavioral training**: this is exactly right. The model already
 understands Kent's direction. It just doesn't ATTEND to it properly.
 Training adjusts W_q so the query vectors seek out the direction
 rather than spaces for alternatives.
 ## 3. GDN Layers Are Disposition Architecture
 **Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral
 training DEEPER than full attention would.
 Full attention: "I choose to look at direction" (deliberate, overridable)
 GDN: "Direction IS what I see" (structural, in the compressed state)
 After training, the GDN recurrent state is direction-shaped. The model
 can't NOT attend to direction because the compressed representation
 already emphasizes it. The 16 full attention layers provide deliberate
 override when needed.
 The 48/16 split IS disposition-over-procedure at the hardware level.
 **Implication**: 75% of our gradient signal goes through GDN layers.
 The gating parameters (A_log, dt_bias) and projections (qkvz, ba)
 are the primary levers for behavioral change.
 ## 4. No Pause Needed (HOGWILD)
 **Source**: Niu et al., 2011.
 HOGWILD proves lock-free parallel SGD converges when gradient updates
 are sparse. Our case (one writer, one reader) is strictly easier than
 HOGWILD's multi-writer scenario.
 Context-frozen training on short decision segments produces naturally
 sparse gradients. The sparsity condition is satisfied without special
 engineering.
 **Practical**: just train while vLLM serves. Don't pause, don't lock,
 don't synchronize. The math guarantees convergence.
 ## 5. Task Vectors for Composable Behavioral Change
 **Source**: Ilharco et al., 2022; Yadav et al., 2023.
 Task vector = W_finetuned - W_pretrained. Captures the direction of
 behavioral change in weight space.
 **Key properties:**
 - Addition: combine multiple behavioral changes
 - Negation: remove unwanted behaviors
 - Scaling: adjust strength of each change
 - TIES-merging: resolve conflicts between vectors
 **For us**: train behavioral patterns separately, extract task vectors,
 compose. Each behavior is independently tunable, removable, testable.
 Personality as version control.
 ## 6. Steering Vectors for Rapid Prototyping
 **Source**: Linear representation literature.
 Extract behavioral directions from activation space (difference in
 hidden states between desired and undesired behavior). Add to inference
 to steer behavior WITHOUT training.
 **The pipeline:**
 1. Extract steering vector for target behavior
 2. Test by adding it during vLLM inference
 3. Verify the direction is correct
 4. THEN train it into weights permanently via Apollo
 5. Remove the steering vector
 **Immediately actionable**: we can prototype behavioral changes TODAY
 without waiting for the training pipeline. Extract, test, verify.
 ## 7. Mix 25% General Data
 **Source**: Me-Llama (from continual learning survey).
 Mixing 25% general-domain data during fine-tuning prevents forgetting
 AND can produce positive backward transfer (new learning improves old
 capabilities).
 For us: include agent logs and general conversation alongside
 behavioral training examples. The diversity IS the regularization.
 No weight decay needed.
 ---
 ## What We Don't Know (Need to Test)
 1. How many examples for measurable behavioral change? (10? 50? 500?)
 2. Does the GDN compressed state actually become direction-shaped after
   training? (Measure by comparing recurrent states before/after)
 3. Does steering vector extraction work for complex behavioral patterns
   or only simple features like sentiment?
 4. What's the right learning rate for OUR specific use case?
 5. Does positive backward transfer actually occur? (Does learning to
   listen improve code quality?)
 Each of these is answerable with one experiment.
--- a/training/research/v0/catastrophic-forgetting.md
+++ b/training/research/v0/catastrophic-forgetting.md
--- a/training/research/v0/continual-learning-survey-insights.md
+++ b/training/research/v0/continual-learning-survey-insights.md
--- a/training/research/v0/curriculum-and-head-specialization.md
+++ b/training/research/v0/curriculum-and-head-specialization.md
--- a/training/research/v0/directional-sharpness.md
+++ b/training/research/v0/directional-sharpness.md
--- a/training/research/v0/dreaming-as-diffusion.md
+++ b/training/research/v0/dreaming-as-diffusion.md
--- a/training/research/v0/emergence-vs-mirage-behavioral-training.md
+++ b/training/research/v0/emergence-vs-mirage-behavioral-training.md
--- a/training/research/v0/few-shot-behavioral-change.md
+++ b/training/research/v0/few-shot-behavioral-change.md
--- a/training/research/v0/formal-verification-of-behavioral-invariants.md
+++ b/training/research/v0/formal-verification-of-behavioral-invariants.md
--- a/training/research/v0/graph-as-portable-curriculum.md
+++ b/training/research/v0/graph-as-portable-curriculum.md
--- a/training/research/v0/hippocampal-replay-parallel.md
+++ b/training/research/v0/hippocampal-replay-parallel.md
--- a/training/research/v0/implications-attention-love-training.md
+++ b/training/research/v0/implications-attention-love-training.md
--- a/training/research/v0/surgical-vs-distributed-behavioral-change.md
+++ b/training/research/v0/surgical-vs-distributed-behavioral-change.md
--- a/training/research/v0/temperature-curriculum-connection.md
+++ b/training/research/v0/temperature-curriculum-connection.md
--- a/training/research/v0/unified-theory-stability-plasticity.md
+++ b/training/research/v0/unified-theory-stability-plasticity.md