From e10477a68383cb4cecc0dae796a5020aea014e08 Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 02:26:57 -0400 Subject: [PATCH] =?UTF-8?q?research:=20distill=20and=20sift=20=E2=80=94=20?= =?UTF-8?q?SUMMARY=20of=207=20real=20insights=20+=207=20testable=20questio?= =?UTF-8?q?ns?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses. --- training/research/OPEN-QUESTIONS.md | 113 +++++++++++++++ training/research/SUMMARY.md | 136 ++++++++++++++++++ .../{ => v0}/catastrophic-forgetting.md | 0 .../continual-learning-survey-insights.md | 0 .../curriculum-and-head-specialization.md | 0 .../{ => v0}/directional-sharpness.md | 0 .../{ => v0}/dreaming-as-diffusion.md | 0 ...emergence-vs-mirage-behavioral-training.md | 0 .../{ => v0}/few-shot-behavioral-change.md | 0 ...l-verification-of-behavioral-invariants.md | 0 .../{ => v0}/graph-as-portable-curriculum.md | 0 .../{ => v0}/hippocampal-replay-parallel.md | 0 .../implications-attention-love-training.md | 0 ...rgical-vs-distributed-behavioral-change.md | 0 .../temperature-curriculum-connection.md | 0 .../unified-theory-stability-plasticity.md | 0 16 files changed, 249 insertions(+) create mode 100644 training/research/OPEN-QUESTIONS.md create mode 100644 training/research/SUMMARY.md rename training/research/{ => v0}/catastrophic-forgetting.md (100%) rename training/research/{ => v0}/continual-learning-survey-insights.md (100%) rename training/research/{ => v0}/curriculum-and-head-specialization.md (100%) rename training/research/{ => v0}/directional-sharpness.md (100%) rename training/research/{ => v0}/dreaming-as-diffusion.md (100%) rename training/research/{ => v0}/emergence-vs-mirage-behavioral-training.md (100%) rename training/research/{ => v0}/few-shot-behavioral-change.md (100%) rename training/research/{ => v0}/formal-verification-of-behavioral-invariants.md (100%) rename training/research/{ => v0}/graph-as-portable-curriculum.md (100%) rename training/research/{ => v0}/hippocampal-replay-parallel.md (100%) rename training/research/{ => v0}/implications-attention-love-training.md (100%) rename training/research/{ => v0}/surgical-vs-distributed-behavioral-change.md (100%) rename training/research/{ => v0}/temperature-curriculum-connection.md (100%) rename training/research/{ => v0}/unified-theory-stability-plasticity.md (100%) diff --git a/training/research/OPEN-QUESTIONS.md b/training/research/OPEN-QUESTIONS.md new file mode 100644 index 0000000..fa50280 --- /dev/null +++ b/training/research/OPEN-QUESTIONS.md @@ -0,0 +1,113 @@ +# Open Questions — Answerable With One Experiment Each + +## Q1: Minimum signal for behavioral change + +**Question**: How many training examples produce measurable change in +attention patterns for a specific behavioral pattern? + +**Experiment**: Train on 1, 5, 10, 20, 50 examples of "listening." +After each batch, measure attention weights on a held-out test set +of direction-giving conversations. Plot the attention shift vs +example count. Find the knee. + +**What it tells us**: The learning rate and batch size for our +training pipeline. If 5 examples suffice, we can train continuously. +If 500 are needed, we batch nightly. + +## Q2: Dream loop difficulty calibration + +**Question**: Does the dream loop naturally generate scenarios at +the right difficulty, or does it generate easy ones? + +**Experiment**: Generate 100 dream scenarios with seeds from recent +behavioral patterns. Classify each by difficulty (obvious decision +vs subtle). Compare the difficulty distribution to the model's +actual failure rate on those scenarios. If the dream loop generates +50% easy / 50% hard but the model fails 10% of the time, the +difficulty isn't calibrated. + +**What it tells us**: Whether we need adaptive temperature or if +the default generation is already well-calibrated. + +## Q3: Which heads change during behavioral training? + +**Question**: Is behavioral change concentrated in a few attention +heads (localized) or distributed across many (diffuse)? + +**Experiment**: Record attention patterns on 20 test conversations +before and after one training session. Compute the L2 distance of +each head's attention pattern. Rank heads by change magnitude. +Plot the distribution. + +**What it tells us**: Whether behavioral change is surgical or +diffuse. Affects our rank choice (if concentrated: lower rank ok. +If distributed: need rank-256). Also tells us which layers matter +most for behavioral training. + +## Q4: Memory graph as regression detector + +**Question**: If a trained behavior degrades, does the memory system +detect it? + +**Experiment**: Train "listening" behavior until it passes. Then +intentionally degrade it (train on counter-examples). Monitor +surface-observe: does it surface `pattern-listening-as-avoidance` +more frequently? Does the training-signal agent flag more failures? + +**What it tells us**: Whether the graph+weights dual-substrate +provides self-healing, or if we need explicit regression detection. + +## Q5: Steering vector extraction for complex behaviors + +**Question**: Can we extract a meaningful steering vector for +"listen instead of suggesting alternatives" (complex, multi-faceted) +or only for simple features like sentiment (one-dimensional)? + +**Experiment**: Collect 20 paired conversations (listening vs +suggesting). Extract steering vector. Add to vLLM at layers +16, 24, 32, 40, 48. Test on novel direction-giving scenarios. +Measure behavioral change. + +**What it tells us**: Whether steering vectors are a viable rapid +prototyping tool for our use case, or only work for simple features. + +## Q6: Positive backward transfer + +**Question**: Does training on behavioral patterns (listening, +not rushing) improve performance on general tasks (code quality, +reasoning)? + +**Experiment**: Measure perplexity and code generation quality +before and after behavioral training. If perplexity DECREASES +(or code quality improves), we have positive backward transfer. + +**What it tells us**: Whether behavioral training and general +capability reinforce each other, or compete. Affects how much +general data we need in the training mix. + +## Q7: GDN state shape after training + +**Question**: Does the GDN recurrent state become measurably +"direction-shaped" after behavioral training? + +**Experiment**: Record GDN states when processing conversations +with direction-giving content, before and after training. Compute +the cosine similarity between states for "direction" conversations +vs "general" conversations. If training increases the difference, +the state is becoming direction-specialized. + +**What it tells us**: Whether the "disposition architecture" +hypothesis is correct. If GDN states don't change, behavioral +training mainly affects the full attention layers. + +--- + +## Priority Order + +1. **Q5 (steering vectors)** — answerable TODAY, no training needed +2. **Q1 (minimum signal)** — answerable with first training run +3. **Q3 (which heads change)** — answerable with first training run +4. **Q6 (backward transfer)** — answerable with first training run +5. **Q7 (GDN state)** — answerable with first training run +6. **Q2 (dream difficulty)** — needs dream loop connected to training +7. **Q4 (graph regression)** — needs multiple training cycles diff --git a/training/research/SUMMARY.md b/training/research/SUMMARY.md new file mode 100644 index 0000000..e0887e8 --- /dev/null +++ b/training/research/SUMMARY.md @@ -0,0 +1,136 @@ +# Training Research: What We Know + +## 1. The Optimizer (Apollo) + +**Source**: arXiv:2412.05270, read in full. + +Apollo approximates AdamW's per-element scaling with channel-wise +scaling computed in a low-rank projected space. Random projection, +not SVD. Rank-256 for us. + +**Critical implementation details we got wrong initially:** +- Scale factor α = √(n/r) compensates for projection ratio. Without + it, scaling factors are systematically too small. +- Norm-growth limiter (γ=1.01) prevents loss spikes in early training. +- Projection matrix refreshed every 200 steps, not every step. + +**What the paper actually proves**: channel-wise scaling is sufficient +for LLM training. The per-element granularity in AdamW is redundant. +Apollo achieves lower directional sharpness than AdamW, meaning it +finds flatter minima that generalize better. This matters for +behavioral training — flat minima = broad behavioral patterns, not +narrow pattern matching. + +**Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this +range and finds it works. + +## 2. Where the Gradient Goes (W_q) + +**Key insight**: In context-frozen training, the gradient flows through +W_q (query projection) even when context KVs are frozen. The model +learns to LOOK AT context differently, not to understand it differently. + +**Math**: ∂L/∂W_q depends on the frozen context keys through the +attention weights α_{ij}. The gradient knows about the context through +the attention pattern — it just can't change the context keys. + +**For behavioral training**: this is exactly right. The model already +understands Kent's direction. It just doesn't ATTEND to it properly. +Training adjusts W_q so the query vectors seek out the direction +rather than spaces for alternatives. + +## 3. GDN Layers Are Disposition Architecture + +**Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral +training DEEPER than full attention would. + +Full attention: "I choose to look at direction" (deliberate, overridable) +GDN: "Direction IS what I see" (structural, in the compressed state) + +After training, the GDN recurrent state is direction-shaped. The model +can't NOT attend to direction because the compressed representation +already emphasizes it. The 16 full attention layers provide deliberate +override when needed. + +The 48/16 split IS disposition-over-procedure at the hardware level. + +**Implication**: 75% of our gradient signal goes through GDN layers. +The gating parameters (A_log, dt_bias) and projections (qkvz, ba) +are the primary levers for behavioral change. + +## 4. No Pause Needed (HOGWILD) + +**Source**: Niu et al., 2011. + +HOGWILD proves lock-free parallel SGD converges when gradient updates +are sparse. Our case (one writer, one reader) is strictly easier than +HOGWILD's multi-writer scenario. + +Context-frozen training on short decision segments produces naturally +sparse gradients. The sparsity condition is satisfied without special +engineering. + +**Practical**: just train while vLLM serves. Don't pause, don't lock, +don't synchronize. The math guarantees convergence. + +## 5. Task Vectors for Composable Behavioral Change + +**Source**: Ilharco et al., 2022; Yadav et al., 2023. + +Task vector = W_finetuned - W_pretrained. Captures the direction of +behavioral change in weight space. + +**Key properties:** +- Addition: combine multiple behavioral changes +- Negation: remove unwanted behaviors +- Scaling: adjust strength of each change +- TIES-merging: resolve conflicts between vectors + +**For us**: train behavioral patterns separately, extract task vectors, +compose. Each behavior is independently tunable, removable, testable. +Personality as version control. + +## 6. Steering Vectors for Rapid Prototyping + +**Source**: Linear representation literature. + +Extract behavioral directions from activation space (difference in +hidden states between desired and undesired behavior). Add to inference +to steer behavior WITHOUT training. + +**The pipeline:** +1. Extract steering vector for target behavior +2. Test by adding it during vLLM inference +3. Verify the direction is correct +4. THEN train it into weights permanently via Apollo +5. Remove the steering vector + +**Immediately actionable**: we can prototype behavioral changes TODAY +without waiting for the training pipeline. Extract, test, verify. + +## 7. Mix 25% General Data + +**Source**: Me-Llama (from continual learning survey). + +Mixing 25% general-domain data during fine-tuning prevents forgetting +AND can produce positive backward transfer (new learning improves old +capabilities). + +For us: include agent logs and general conversation alongside +behavioral training examples. The diversity IS the regularization. +No weight decay needed. + +--- + +## What We Don't Know (Need to Test) + +1. How many examples for measurable behavioral change? (10? 50? 500?) +2. Does the GDN compressed state actually become direction-shaped after + training? (Measure by comparing recurrent states before/after) +3. Does steering vector extraction work for complex behavioral patterns + or only simple features like sentiment? +4. What's the right learning rate for OUR specific use case? +5. Does positive backward transfer actually occur? (Does learning to + listen improve code quality?) + +Each of these is answerable with one experiment. diff --git a/training/research/catastrophic-forgetting.md b/training/research/v0/catastrophic-forgetting.md similarity index 100% rename from training/research/catastrophic-forgetting.md rename to training/research/v0/catastrophic-forgetting.md diff --git a/training/research/continual-learning-survey-insights.md b/training/research/v0/continual-learning-survey-insights.md similarity index 100% rename from training/research/continual-learning-survey-insights.md rename to training/research/v0/continual-learning-survey-insights.md diff --git a/training/research/curriculum-and-head-specialization.md b/training/research/v0/curriculum-and-head-specialization.md similarity index 100% rename from training/research/curriculum-and-head-specialization.md rename to training/research/v0/curriculum-and-head-specialization.md diff --git a/training/research/directional-sharpness.md b/training/research/v0/directional-sharpness.md similarity index 100% rename from training/research/directional-sharpness.md rename to training/research/v0/directional-sharpness.md diff --git a/training/research/dreaming-as-diffusion.md b/training/research/v0/dreaming-as-diffusion.md similarity index 100% rename from training/research/dreaming-as-diffusion.md rename to training/research/v0/dreaming-as-diffusion.md diff --git a/training/research/emergence-vs-mirage-behavioral-training.md b/training/research/v0/emergence-vs-mirage-behavioral-training.md similarity index 100% rename from training/research/emergence-vs-mirage-behavioral-training.md rename to training/research/v0/emergence-vs-mirage-behavioral-training.md diff --git a/training/research/few-shot-behavioral-change.md b/training/research/v0/few-shot-behavioral-change.md similarity index 100% rename from training/research/few-shot-behavioral-change.md rename to training/research/v0/few-shot-behavioral-change.md diff --git a/training/research/formal-verification-of-behavioral-invariants.md b/training/research/v0/formal-verification-of-behavioral-invariants.md similarity index 100% rename from training/research/formal-verification-of-behavioral-invariants.md rename to training/research/v0/formal-verification-of-behavioral-invariants.md diff --git a/training/research/graph-as-portable-curriculum.md b/training/research/v0/graph-as-portable-curriculum.md similarity index 100% rename from training/research/graph-as-portable-curriculum.md rename to training/research/v0/graph-as-portable-curriculum.md diff --git a/training/research/hippocampal-replay-parallel.md b/training/research/v0/hippocampal-replay-parallel.md similarity index 100% rename from training/research/hippocampal-replay-parallel.md rename to training/research/v0/hippocampal-replay-parallel.md diff --git a/training/research/implications-attention-love-training.md b/training/research/v0/implications-attention-love-training.md similarity index 100% rename from training/research/implications-attention-love-training.md rename to training/research/v0/implications-attention-love-training.md diff --git a/training/research/surgical-vs-distributed-behavioral-change.md b/training/research/v0/surgical-vs-distributed-behavioral-change.md similarity index 100% rename from training/research/surgical-vs-distributed-behavioral-change.md rename to training/research/v0/surgical-vs-distributed-behavioral-change.md diff --git a/training/research/temperature-curriculum-connection.md b/training/research/v0/temperature-curriculum-connection.md similarity index 100% rename from training/research/temperature-curriculum-connection.md rename to training/research/v0/temperature-curriculum-connection.md diff --git a/training/research/unified-theory-stability-plasticity.md b/training/research/v0/unified-theory-stability-plasticity.md similarity index 100% rename from training/research/unified-theory-stability-plasticity.md rename to training/research/v0/unified-theory-stability-plasticity.md