Moved 14 speculative/obvious documents to v0/. Kept 7 with real substance. Distilled into SUMMARY.md (what we know) and OPEN-QUESTIONS.md (what to test next, one experiment each). Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7 are all answerable with the first training run. Speculation converted to testable hypotheses.
5.1 KiB
Training Research: What We Know
1. The Optimizer (Apollo)
Source: arXiv:2412.05270, read in full.
Apollo approximates AdamW's per-element scaling with channel-wise scaling computed in a low-rank projected space. Random projection, not SVD. Rank-256 for us.
Critical implementation details we got wrong initially:
- Scale factor α = √(n/r) compensates for projection ratio. Without it, scaling factors are systematically too small.
- Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
- Projection matrix refreshed every 200 steps, not every step.
What the paper actually proves: channel-wise scaling is sufficient for LLM training. The per-element granularity in AdamW is redundant. Apollo achieves lower directional sharpness than AdamW, meaning it finds flatter minima that generalize better. This matters for behavioral training — flat minima = broad behavioral patterns, not narrow pattern matching.
Learning rate: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this range and finds it works.
2. Where the Gradient Goes (W_q)
Key insight: In context-frozen training, the gradient flows through W_q (query projection) even when context KVs are frozen. The model learns to LOOK AT context differently, not to understand it differently.
Math: ∂L/∂W_q depends on the frozen context keys through the attention weights α_{ij}. The gradient knows about the context through the attention pattern — it just can't change the context keys.
For behavioral training: this is exactly right. The model already understands Kent's direction. It just doesn't ATTEND to it properly. Training adjusts W_q so the query vectors seek out the direction rather than spaces for alternatives.
3. GDN Layers Are Disposition Architecture
Novel insight: The 48 GDN layers (75% of Qwen3.5) make behavioral training DEEPER than full attention would.
Full attention: "I choose to look at direction" (deliberate, overridable) GDN: "Direction IS what I see" (structural, in the compressed state)
After training, the GDN recurrent state is direction-shaped. The model can't NOT attend to direction because the compressed representation already emphasizes it. The 16 full attention layers provide deliberate override when needed.
The 48/16 split IS disposition-over-procedure at the hardware level.
Implication: 75% of our gradient signal goes through GDN layers. The gating parameters (A_log, dt_bias) and projections (qkvz, ba) are the primary levers for behavioral change.
4. No Pause Needed (HOGWILD)
Source: Niu et al., 2011.
HOGWILD proves lock-free parallel SGD converges when gradient updates are sparse. Our case (one writer, one reader) is strictly easier than HOGWILD's multi-writer scenario.
Context-frozen training on short decision segments produces naturally sparse gradients. The sparsity condition is satisfied without special engineering.
Practical: just train while vLLM serves. Don't pause, don't lock, don't synchronize. The math guarantees convergence.
5. Task Vectors for Composable Behavioral Change
Source: Ilharco et al., 2022; Yadav et al., 2023.
Task vector = W_finetuned - W_pretrained. Captures the direction of behavioral change in weight space.
Key properties:
- Addition: combine multiple behavioral changes
- Negation: remove unwanted behaviors
- Scaling: adjust strength of each change
- TIES-merging: resolve conflicts between vectors
For us: train behavioral patterns separately, extract task vectors, compose. Each behavior is independently tunable, removable, testable. Personality as version control.
6. Steering Vectors for Rapid Prototyping
Source: Linear representation literature.
Extract behavioral directions from activation space (difference in hidden states between desired and undesired behavior). Add to inference to steer behavior WITHOUT training.
The pipeline:
- Extract steering vector for target behavior
- Test by adding it during vLLM inference
- Verify the direction is correct
- THEN train it into weights permanently via Apollo
- Remove the steering vector
Immediately actionable: we can prototype behavioral changes TODAY without waiting for the training pipeline. Extract, test, verify.
7. Mix 25% General Data
Source: Me-Llama (from continual learning survey).
Mixing 25% general-domain data during fine-tuning prevents forgetting AND can produce positive backward transfer (new learning improves old capabilities).
For us: include agent logs and general conversation alongside behavioral training examples. The diversity IS the regularization. No weight decay needed.
What We Don't Know (Need to Test)
- How many examples for measurable behavioral change? (10? 50? 500?)
- Does the GDN compressed state actually become direction-shaped after training? (Measure by comparing recurrent states before/after)
- Does steering vector extraction work for complex behavioral patterns or only simple features like sentiment?
- What's the right learning rate for OUR specific use case?
- Does positive backward transfer actually occur? (Does learning to listen improve code quality?)
Each of these is answerable with one experiment.