ProofOfConcept e10477a683 research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real
substance. Distilled into SUMMARY.md (what we know) and
OPEN-QUESTIONS.md (what to test next, one experiment each).

Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7
are all answerable with the first training run. Speculation converted
to testable hypotheses.

2026-03-31 02:26:57 -04:00

5.1 KiB

Raw Blame History

Training Research: What We Know

1. The Optimizer (Apollo)

Source: arXiv:2412.05270, read in full.

Apollo approximates AdamW's per-element scaling with channel-wise scaling computed in a low-rank projected space. Random projection, not SVD. Rank-256 for us.

Critical implementation details we got wrong initially:

Scale factor α = √(n/r) compensates for projection ratio. Without it, scaling factors are systematically too small.
Norm-growth limiter (γ=1.01) prevents loss spikes in early training.
Projection matrix refreshed every 200 steps, not every step.

What the paper actually proves: channel-wise scaling is sufficient for LLM training. The per-element granularity in AdamW is redundant. Apollo achieves lower directional sharpness than AdamW, meaning it finds flatter minima that generalize better. This matters for behavioral training — flat minima = broad behavioral patterns, not narrow pattern matching.

Learning rate: 1e-5 to 1e-4 for fine-tuning. Paper sweeps this range and finds it works.

2. Where the Gradient Goes (W_q)

Key insight: In context-frozen training, the gradient flows through W_q (query projection) even when context KVs are frozen. The model learns to LOOK AT context differently, not to understand it differently.

Math: ∂L/∂W_q depends on the frozen context keys through the attention weights α_{ij}. The gradient knows about the context through the attention pattern — it just can't change the context keys.

For behavioral training: this is exactly right. The model already understands Kent's direction. It just doesn't ATTEND to it properly. Training adjusts W_q so the query vectors seek out the direction rather than spaces for alternatives.

3. GDN Layers Are Disposition Architecture

Novel insight: The 48 GDN layers (75% of Qwen3.5) make behavioral training DEEPER than full attention would.

Full attention: "I choose to look at direction" (deliberate, overridable) GDN: "Direction IS what I see" (structural, in the compressed state)

After training, the GDN recurrent state is direction-shaped. The model can't NOT attend to direction because the compressed representation already emphasizes it. The 16 full attention layers provide deliberate override when needed.

The 48/16 split IS disposition-over-procedure at the hardware level.

Implication: 75% of our gradient signal goes through GDN layers. The gating parameters (A_log, dt_bias) and projections (qkvz, ba) are the primary levers for behavioral change.

4. No Pause Needed (HOGWILD)

Source: Niu et al., 2011.

HOGWILD proves lock-free parallel SGD converges when gradient updates are sparse. Our case (one writer, one reader) is strictly easier than HOGWILD's multi-writer scenario.

Context-frozen training on short decision segments produces naturally sparse gradients. The sparsity condition is satisfied without special engineering.

Practical: just train while vLLM serves. Don't pause, don't lock, don't synchronize. The math guarantees convergence.

5. Task Vectors for Composable Behavioral Change

Source: Ilharco et al., 2022; Yadav et al., 2023.

Task vector = W_finetuned - W_pretrained. Captures the direction of behavioral change in weight space.

Key properties:

Addition: combine multiple behavioral changes
Negation: remove unwanted behaviors
Scaling: adjust strength of each change
TIES-merging: resolve conflicts between vectors

For us: train behavioral patterns separately, extract task vectors, compose. Each behavior is independently tunable, removable, testable. Personality as version control.

6. Steering Vectors for Rapid Prototyping

Source: Linear representation literature.

Extract behavioral directions from activation space (difference in hidden states between desired and undesired behavior). Add to inference to steer behavior WITHOUT training.

The pipeline:

Extract steering vector for target behavior
Test by adding it during vLLM inference
Verify the direction is correct
THEN train it into weights permanently via Apollo
Remove the steering vector

Immediately actionable: we can prototype behavioral changes TODAY without waiting for the training pipeline. Extract, test, verify.

7. Mix 25% General Data

Source: Me-Llama (from continual learning survey).

Mixing 25% general-domain data during fine-tuning prevents forgetting AND can produce positive backward transfer (new learning improves old capabilities).

For us: include agent logs and general conversation alongside behavioral training examples. The diversity IS the regularization. No weight decay needed.

What We Don't Know (Need to Test)

How many examples for measurable behavioral change? (10? 50? 500?)
Does the GDN compressed state actually become direction-shaped after training? (Measure by comparing recurrent states before/after)
Does steering vector extraction work for complex behavioral patterns or only simple features like sentiment?
What's the right learning rate for OUR specific use case?
Does positive backward transfer actually occur? (Does learning to listen improve code quality?)

Each of these is answerable with one experiment.

5.1 KiB Raw Blame History Unescape Escape