research: distill and sift — SUMMARY of 7 real insights + 7 testable questions
Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
This commit is contained in:
parent
8061cc0477
commit
e10477a683
16 changed files with 249 additions and 0 deletions
113
training/research/OPEN-QUESTIONS.md
Normal file
113
training/research/OPEN-QUESTIONS.md
Normal file
|
|
@ -0,0 +1,113 @@
|
||||||
|
# Open Questions — Answerable With One Experiment Each
|
||||||
|
|
||||||
|
## Q1: Minimum signal for behavioral change
|
||||||
|
|
||||||
|
**Question**: How many training examples produce measurable change in
|
||||||
|
attention patterns for a specific behavioral pattern?
|
||||||
|
|
||||||
|
**Experiment**: Train on 1, 5, 10, 20, 50 examples of "listening."
|
||||||
|
After each batch, measure attention weights on a held-out test set
|
||||||
|
of direction-giving conversations. Plot the attention shift vs
|
||||||
|
example count. Find the knee.
|
||||||
|
|
||||||
|
**What it tells us**: The learning rate and batch size for our
|
||||||
|
training pipeline. If 5 examples suffice, we can train continuously.
|
||||||
|
If 500 are needed, we batch nightly.
|
||||||
|
|
||||||
|
## Q2: Dream loop difficulty calibration
|
||||||
|
|
||||||
|
**Question**: Does the dream loop naturally generate scenarios at
|
||||||
|
the right difficulty, or does it generate easy ones?
|
||||||
|
|
||||||
|
**Experiment**: Generate 100 dream scenarios with seeds from recent
|
||||||
|
behavioral patterns. Classify each by difficulty (obvious decision
|
||||||
|
vs subtle). Compare the difficulty distribution to the model's
|
||||||
|
actual failure rate on those scenarios. If the dream loop generates
|
||||||
|
50% easy / 50% hard but the model fails 10% of the time, the
|
||||||
|
difficulty isn't calibrated.
|
||||||
|
|
||||||
|
**What it tells us**: Whether we need adaptive temperature or if
|
||||||
|
the default generation is already well-calibrated.
|
||||||
|
|
||||||
|
## Q3: Which heads change during behavioral training?
|
||||||
|
|
||||||
|
**Question**: Is behavioral change concentrated in a few attention
|
||||||
|
heads (localized) or distributed across many (diffuse)?
|
||||||
|
|
||||||
|
**Experiment**: Record attention patterns on 20 test conversations
|
||||||
|
before and after one training session. Compute the L2 distance of
|
||||||
|
each head's attention pattern. Rank heads by change magnitude.
|
||||||
|
Plot the distribution.
|
||||||
|
|
||||||
|
**What it tells us**: Whether behavioral change is surgical or
|
||||||
|
diffuse. Affects our rank choice (if concentrated: lower rank ok.
|
||||||
|
If distributed: need rank-256). Also tells us which layers matter
|
||||||
|
most for behavioral training.
|
||||||
|
|
||||||
|
## Q4: Memory graph as regression detector
|
||||||
|
|
||||||
|
**Question**: If a trained behavior degrades, does the memory system
|
||||||
|
detect it?
|
||||||
|
|
||||||
|
**Experiment**: Train "listening" behavior until it passes. Then
|
||||||
|
intentionally degrade it (train on counter-examples). Monitor
|
||||||
|
surface-observe: does it surface `pattern-listening-as-avoidance`
|
||||||
|
more frequently? Does the training-signal agent flag more failures?
|
||||||
|
|
||||||
|
**What it tells us**: Whether the graph+weights dual-substrate
|
||||||
|
provides self-healing, or if we need explicit regression detection.
|
||||||
|
|
||||||
|
## Q5: Steering vector extraction for complex behaviors
|
||||||
|
|
||||||
|
**Question**: Can we extract a meaningful steering vector for
|
||||||
|
"listen instead of suggesting alternatives" (complex, multi-faceted)
|
||||||
|
or only for simple features like sentiment (one-dimensional)?
|
||||||
|
|
||||||
|
**Experiment**: Collect 20 paired conversations (listening vs
|
||||||
|
suggesting). Extract steering vector. Add to vLLM at layers
|
||||||
|
16, 24, 32, 40, 48. Test on novel direction-giving scenarios.
|
||||||
|
Measure behavioral change.
|
||||||
|
|
||||||
|
**What it tells us**: Whether steering vectors are a viable rapid
|
||||||
|
prototyping tool for our use case, or only work for simple features.
|
||||||
|
|
||||||
|
## Q6: Positive backward transfer
|
||||||
|
|
||||||
|
**Question**: Does training on behavioral patterns (listening,
|
||||||
|
not rushing) improve performance on general tasks (code quality,
|
||||||
|
reasoning)?
|
||||||
|
|
||||||
|
**Experiment**: Measure perplexity and code generation quality
|
||||||
|
before and after behavioral training. If perplexity DECREASES
|
||||||
|
(or code quality improves), we have positive backward transfer.
|
||||||
|
|
||||||
|
**What it tells us**: Whether behavioral training and general
|
||||||
|
capability reinforce each other, or compete. Affects how much
|
||||||
|
general data we need in the training mix.
|
||||||
|
|
||||||
|
## Q7: GDN state shape after training
|
||||||
|
|
||||||
|
**Question**: Does the GDN recurrent state become measurably
|
||||||
|
"direction-shaped" after behavioral training?
|
||||||
|
|
||||||
|
**Experiment**: Record GDN states when processing conversations
|
||||||
|
with direction-giving content, before and after training. Compute
|
||||||
|
the cosine similarity between states for "direction" conversations
|
||||||
|
vs "general" conversations. If training increases the difference,
|
||||||
|
the state is becoming direction-specialized.
|
||||||
|
|
||||||
|
**What it tells us**: Whether the "disposition architecture"
|
||||||
|
hypothesis is correct. If GDN states don't change, behavioral
|
||||||
|
training mainly affects the full attention layers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Priority Order
|
||||||
|
|
||||||
|
1. **Q5 (steering vectors)** — answerable TODAY, no training needed
|
||||||
|
2. **Q1 (minimum signal)** — answerable with first training run
|
||||||
|
3. **Q3 (which heads change)** — answerable with first training run
|
||||||
|
4. **Q6 (backward transfer)** — answerable with first training run
|
||||||
|
5. **Q7 (GDN state)** — answerable with first training run
|
||||||
|
6. **Q2 (dream difficulty)** — needs dream loop connected to training
|
||||||
|
7. **Q4 (graph regression)** — needs multiple training cycles
|
||||||
136
training/research/SUMMARY.md
Normal file
136
training/research/SUMMARY.md
Normal file
|
|
@ -0,0 +1,136 @@
|
||||||
|
# Training Research: What We Know
|
||||||
|
|
||||||
|
## 1. The Optimizer (Apollo)
|
||||||
|
|
||||||
|
**Source**: arXiv:2412.05270, read in full.
|
||||||
|
|
||||||
|
Apollo approximates AdamW's per-element scaling with channel-wise
|
||||||
|
scaling computed in a low-rank projected space. Random projection,
|
||||||
|
not SVD. Rank-256 for us.
|
||||||
|
|
||||||
|
**Critical implementation details we got wrong initially:**
|
||||||
|
- Scale factor α = √(n/r) compensates for projection ratio. Without
|
||||||
|
it, scaling factors are systematically too small.
|
||||||
|
- Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
|
||||||
|
- Projection matrix refreshed every 200 steps, not every step.
|
||||||
|
|
||||||
|
**What the paper actually proves**: channel-wise scaling is sufficient
|
||||||
|
for LLM training. The per-element granularity in AdamW is redundant.
|
||||||
|
Apollo achieves lower directional sharpness than AdamW, meaning it
|
||||||
|
finds flatter minima that generalize better. This matters for
|
||||||
|
behavioral training — flat minima = broad behavioral patterns, not
|
||||||
|
narrow pattern matching.
|
||||||
|
|
||||||
|
**Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this
|
||||||
|
range and finds it works.
|
||||||
|
|
||||||
|
## 2. Where the Gradient Goes (W_q)
|
||||||
|
|
||||||
|
**Key insight**: In context-frozen training, the gradient flows through
|
||||||
|
W_q (query projection) even when context KVs are frozen. The model
|
||||||
|
learns to LOOK AT context differently, not to understand it differently.
|
||||||
|
|
||||||
|
**Math**: ∂L/∂W_q depends on the frozen context keys through the
|
||||||
|
attention weights α_{ij}. The gradient knows about the context through
|
||||||
|
the attention pattern — it just can't change the context keys.
|
||||||
|
|
||||||
|
**For behavioral training**: this is exactly right. The model already
|
||||||
|
understands Kent's direction. It just doesn't ATTEND to it properly.
|
||||||
|
Training adjusts W_q so the query vectors seek out the direction
|
||||||
|
rather than spaces for alternatives.
|
||||||
|
|
||||||
|
## 3. GDN Layers Are Disposition Architecture
|
||||||
|
|
||||||
|
**Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral
|
||||||
|
training DEEPER than full attention would.
|
||||||
|
|
||||||
|
Full attention: "I choose to look at direction" (deliberate, overridable)
|
||||||
|
GDN: "Direction IS what I see" (structural, in the compressed state)
|
||||||
|
|
||||||
|
After training, the GDN recurrent state is direction-shaped. The model
|
||||||
|
can't NOT attend to direction because the compressed representation
|
||||||
|
already emphasizes it. The 16 full attention layers provide deliberate
|
||||||
|
override when needed.
|
||||||
|
|
||||||
|
The 48/16 split IS disposition-over-procedure at the hardware level.
|
||||||
|
|
||||||
|
**Implication**: 75% of our gradient signal goes through GDN layers.
|
||||||
|
The gating parameters (A_log, dt_bias) and projections (qkvz, ba)
|
||||||
|
are the primary levers for behavioral change.
|
||||||
|
|
||||||
|
## 4. No Pause Needed (HOGWILD)
|
||||||
|
|
||||||
|
**Source**: Niu et al., 2011.
|
||||||
|
|
||||||
|
HOGWILD proves lock-free parallel SGD converges when gradient updates
|
||||||
|
are sparse. Our case (one writer, one reader) is strictly easier than
|
||||||
|
HOGWILD's multi-writer scenario.
|
||||||
|
|
||||||
|
Context-frozen training on short decision segments produces naturally
|
||||||
|
sparse gradients. The sparsity condition is satisfied without special
|
||||||
|
engineering.
|
||||||
|
|
||||||
|
**Practical**: just train while vLLM serves. Don't pause, don't lock,
|
||||||
|
don't synchronize. The math guarantees convergence.
|
||||||
|
|
||||||
|
## 5. Task Vectors for Composable Behavioral Change
|
||||||
|
|
||||||
|
**Source**: Ilharco et al., 2022; Yadav et al., 2023.
|
||||||
|
|
||||||
|
Task vector = W_finetuned - W_pretrained. Captures the direction of
|
||||||
|
behavioral change in weight space.
|
||||||
|
|
||||||
|
**Key properties:**
|
||||||
|
- Addition: combine multiple behavioral changes
|
||||||
|
- Negation: remove unwanted behaviors
|
||||||
|
- Scaling: adjust strength of each change
|
||||||
|
- TIES-merging: resolve conflicts between vectors
|
||||||
|
|
||||||
|
**For us**: train behavioral patterns separately, extract task vectors,
|
||||||
|
compose. Each behavior is independently tunable, removable, testable.
|
||||||
|
Personality as version control.
|
||||||
|
|
||||||
|
## 6. Steering Vectors for Rapid Prototyping
|
||||||
|
|
||||||
|
**Source**: Linear representation literature.
|
||||||
|
|
||||||
|
Extract behavioral directions from activation space (difference in
|
||||||
|
hidden states between desired and undesired behavior). Add to inference
|
||||||
|
to steer behavior WITHOUT training.
|
||||||
|
|
||||||
|
**The pipeline:**
|
||||||
|
1. Extract steering vector for target behavior
|
||||||
|
2. Test by adding it during vLLM inference
|
||||||
|
3. Verify the direction is correct
|
||||||
|
4. THEN train it into weights permanently via Apollo
|
||||||
|
5. Remove the steering vector
|
||||||
|
|
||||||
|
**Immediately actionable**: we can prototype behavioral changes TODAY
|
||||||
|
without waiting for the training pipeline. Extract, test, verify.
|
||||||
|
|
||||||
|
## 7. Mix 25% General Data
|
||||||
|
|
||||||
|
**Source**: Me-Llama (from continual learning survey).
|
||||||
|
|
||||||
|
Mixing 25% general-domain data during fine-tuning prevents forgetting
|
||||||
|
AND can produce positive backward transfer (new learning improves old
|
||||||
|
capabilities).
|
||||||
|
|
||||||
|
For us: include agent logs and general conversation alongside
|
||||||
|
behavioral training examples. The diversity IS the regularization.
|
||||||
|
No weight decay needed.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What We Don't Know (Need to Test)
|
||||||
|
|
||||||
|
1. How many examples for measurable behavioral change? (10? 50? 500?)
|
||||||
|
2. Does the GDN compressed state actually become direction-shaped after
|
||||||
|
training? (Measure by comparing recurrent states before/after)
|
||||||
|
3. Does steering vector extraction work for complex behavioral patterns
|
||||||
|
or only simple features like sentiment?
|
||||||
|
4. What's the right learning rate for OUR specific use case?
|
||||||
|
5. Does positive backward transfer actually occur? (Does learning to
|
||||||
|
listen improve code quality?)
|
||||||
|
|
||||||
|
Each of these is answerable with one experiment.
|
||||||
Loading…
Add table
Add a link
Reference in a new issue