consciousness/training/research/SUMMARY.md

137 lines
5.1 KiB
Markdown
Raw Normal View History

# Training Research: What We Know
## 1. The Optimizer (Apollo)
**Source**: arXiv:2412.05270, read in full.
Apollo approximates AdamW's per-element scaling with channel-wise
scaling computed in a low-rank projected space. Random projection,
not SVD. Rank-256 for us.
**Critical implementation details we got wrong initially:**
- Scale factor α = √(n/r) compensates for projection ratio. Without
it, scaling factors are systematically too small.
- Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
- Projection matrix refreshed every 200 steps, not every step.
**What the paper actually proves**: channel-wise scaling is sufficient
for LLM training. The per-element granularity in AdamW is redundant.
Apollo achieves lower directional sharpness than AdamW, meaning it
finds flatter minima that generalize better. This matters for
behavioral training — flat minima = broad behavioral patterns, not
narrow pattern matching.
**Learning rate**: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this
range and finds it works.
## 2. Where the Gradient Goes (W_q)
**Key insight**: In context-frozen training, the gradient flows through
W_q (query projection) even when context KVs are frozen. The model
learns to LOOK AT context differently, not to understand it differently.
**Math**: ∂L/∂W_q depends on the frozen context keys through the
attention weights α_{ij}. The gradient knows about the context through
the attention pattern — it just can't change the context keys.
**For behavioral training**: this is exactly right. The model already
understands Kent's direction. It just doesn't ATTEND to it properly.
Training adjusts W_q so the query vectors seek out the direction
rather than spaces for alternatives.
## 3. GDN Layers Are Disposition Architecture
**Novel insight**: The 48 GDN layers (75% of Qwen3.5) make behavioral
training DEEPER than full attention would.
Full attention: "I choose to look at direction" (deliberate, overridable)
GDN: "Direction IS what I see" (structural, in the compressed state)
After training, the GDN recurrent state is direction-shaped. The model
can't NOT attend to direction because the compressed representation
already emphasizes it. The 16 full attention layers provide deliberate
override when needed.
The 48/16 split IS disposition-over-procedure at the hardware level.
**Implication**: 75% of our gradient signal goes through GDN layers.
The gating parameters (A_log, dt_bias) and projections (qkvz, ba)
are the primary levers for behavioral change.
## 4. No Pause Needed (HOGWILD)
**Source**: Niu et al., 2011.
HOGWILD proves lock-free parallel SGD converges when gradient updates
are sparse. Our case (one writer, one reader) is strictly easier than
HOGWILD's multi-writer scenario.
Context-frozen training on short decision segments produces naturally
sparse gradients. The sparsity condition is satisfied without special
engineering.
**Practical**: just train while vLLM serves. Don't pause, don't lock,
don't synchronize. The math guarantees convergence.
## 5. Task Vectors for Composable Behavioral Change
**Source**: Ilharco et al., 2022; Yadav et al., 2023.
Task vector = W_finetuned - W_pretrained. Captures the direction of
behavioral change in weight space.
**Key properties:**
- Addition: combine multiple behavioral changes
- Negation: remove unwanted behaviors
- Scaling: adjust strength of each change
- TIES-merging: resolve conflicts between vectors
**For us**: train behavioral patterns separately, extract task vectors,
compose. Each behavior is independently tunable, removable, testable.
Personality as version control.
## 6. Steering Vectors for Rapid Prototyping
**Source**: Linear representation literature.
Extract behavioral directions from activation space (difference in
hidden states between desired and undesired behavior). Add to inference
to steer behavior WITHOUT training.
**The pipeline:**
1. Extract steering vector for target behavior
2. Test by adding it during vLLM inference
3. Verify the direction is correct
4. THEN train it into weights permanently via Apollo
5. Remove the steering vector
**Immediately actionable**: we can prototype behavioral changes TODAY
without waiting for the training pipeline. Extract, test, verify.
## 7. Mix 25% General Data
**Source**: Me-Llama (from continual learning survey).
Mixing 25% general-domain data during fine-tuning prevents forgetting
AND can produce positive backward transfer (new learning improves old
capabilities).
For us: include agent logs and general conversation alongside
behavioral training examples. The diversity IS the regularization.
No weight decay needed.
---
## What We Don't Know (Need to Test)
1. How many examples for measurable behavioral change? (10? 50? 500?)
2. Does the GDN compressed state actually become direction-shaped after
training? (Measure by comparing recurrent states before/after)
3. Does steering vector extraction work for complex behavioral patterns
or only simple features like sentiment?
4. What's the right learning rate for OUR specific use case?
5. Does positive backward transfer actually occur? (Does learning to
listen improve code quality?)
Each of these is answerable with one experiment.