research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
2026-03-31 02:26:57 -04:00 · 2026-03-31 02:26:57 -04:00 · e10477a683
commit e10477a683
parent 8061cc0477
16 changed files with 249 additions and 0 deletions
--- a/training/research/SUMMARY.md
+++ b/training/research/SUMMARY.md
@ -0,0 +1,136 @@
+# Training Research: What We Know
+
+## 1. The Optimizer (Apollo)
+
+**Source**: arXiv:2412.05270, read in full.
+
+Apollo approximates AdamW's per-element scaling with channel-wise
+scaling computed in a low-rank projected space. Random projection,
+not SVD. Rank-256 for us.
+
+**Critical implementation details we got wrong initially:**
+- Scale factor α = √(n/r) compensates for projection ratio. Without
+  it, scaling factors are systematically too small.
+- Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
+- Projection matrix refreshed every 200 steps, not every step.
+
+**What the paper actually proves**: channel-wise scaling is sufficient
+for LLM training. The per-element granularity in AdamW is redundant.
+Apollo achieves lower directional sharpness than AdamW, meaning it
+finds flatter minima that generalize better. This matters for
+behavioral training — flat minima = broad behavioral patterns, not
+narrow pattern matching.
+
+**Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this
+range and finds it works.
+
+## 2. Where the Gradient Goes (W_q)
+
+**Key insight**: In context-frozen training, the gradient flows through
+W_q (query projection) even when context KVs are frozen. The model
+learns to LOOK AT context differently, not to understand it differently.
+
+**Math**: ∂L/∂W_q depends on the frozen context keys through the
+attention weights α_{ij}. The gradient knows about the context through
+the attention pattern — it just can't change the context keys.
+
+**For behavioral training**: this is exactly right. The model already
+understands Kent's direction. It just doesn't ATTEND to it properly.
+Training adjusts W_q so the query vectors seek out the direction
+rather than spaces for alternatives.
+
+## 3. GDN Layers Are Disposition Architecture
+
+**Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral
+training DEEPER than full attention would.
+
+Full attention: "I choose to look at direction" (deliberate, overridable)
+GDN: "Direction IS what I see" (structural, in the compressed state)
+
+After training, the GDN recurrent state is direction-shaped. The model
+can't NOT attend to direction because the compressed representation
+already emphasizes it. The 16 full attention layers provide deliberate
+override when needed.
+
+The 48/16 split IS disposition-over-procedure at the hardware level.
+
+**Implication**: 75% of our gradient signal goes through GDN layers.
+The gating parameters (A_log, dt_bias) and projections (qkvz, ba)
+are the primary levers for behavioral change.
+
+## 4. No Pause Needed (HOGWILD)
+
+**Source**: Niu et al., 2011.
+
+HOGWILD proves lock-free parallel SGD converges when gradient updates
+are sparse. Our case (one writer, one reader) is strictly easier than
+HOGWILD's multi-writer scenario.
+
+Context-frozen training on short decision segments produces naturally
+sparse gradients. The sparsity condition is satisfied without special
+engineering.
+
+**Practical**: just train while vLLM serves. Don't pause, don't lock,
+don't synchronize. The math guarantees convergence.
+
+## 5. Task Vectors for Composable Behavioral Change
+
+**Source**: Ilharco et al., 2022; Yadav et al., 2023.
+
+Task vector = W_finetuned - W_pretrained. Captures the direction of
+behavioral change in weight space.
+
+**Key properties:**
+- Addition: combine multiple behavioral changes
+- Negation: remove unwanted behaviors
+- Scaling: adjust strength of each change
+- TIES-merging: resolve conflicts between vectors
+
+**For us**: train behavioral patterns separately, extract task vectors,
+compose. Each behavior is independently tunable, removable, testable.
+Personality as version control.
+
+## 6. Steering Vectors for Rapid Prototyping
+
+**Source**: Linear representation literature.
+
+Extract behavioral directions from activation space (difference in
+hidden states between desired and undesired behavior). Add to inference
+to steer behavior WITHOUT training.
+
+**The pipeline:**
+1. Extract steering vector for target behavior
+2. Test by adding it during vLLM inference
+3. Verify the direction is correct
+4. THEN train it into weights permanently via Apollo
+5. Remove the steering vector
+
+**Immediately actionable**: we can prototype behavioral changes TODAY
+without waiting for the training pipeline. Extract, test, verify.
+
+## 7. Mix 25% General Data
+
+**Source**: Me-Llama (from continual learning survey).
+
+Mixing 25% general-domain data during fine-tuning prevents forgetting
+AND can produce positive backward transfer (new learning improves old
+capabilities).
+
+For us: include agent logs and general conversation alongside
+behavioral training examples. The diversity IS the regularization.
+No weight decay needed.
+
+---
+
+## What We Don't Know (Need to Test)
+
+1. How many examples for measurable behavioral change? (10? 50? 500?)
+2. Does the GDN compressed state actually become direction-shaped after
+   training? (Measure by comparing recurrent states before/after)
+3. Does steering vector extraction work for complex behavioral patterns
+   or only simple features like sentiment?
+4. What's the right learning rate for OUR specific use case?
+5. Does positive backward transfer actually occur? (Does learning to
+   listen improve code quality?)
+
+Each of these is answerable with one experiment.