LLMs as constraint solvers. Fine-tuning adds constraints to an existing solution. Gentle = small steps near the current solution. Coherent = new constraints consistent with existing ones. Diversity is a COHERENCE mechanism — forces the solver to satisfy all constraints simultaneously. Over-training = one constraint dominating = solver drops competing constraints. Predictions for training behavior grounded in this framework.
10 KiB
Practical Intuitions: What Will Actually Happen
Built from empirical findings, not theory.
The Numbers
How many examples for behavioral change?
10 examples broke safety alignment (Qi et al., 2023). GPT-3.5 Turbo's safety guardrails were jailbroken with 10 adversarial examples at $0.20 of fine-tuning compute. Safety alignment is a trained behavioral disposition — same as what we're training.
1000 carefully curated examples matched GPT-4 (LIMA, 2023). Instruction-tuning with 1000 high-quality examples matched models trained on 50,000+ examples.
"General abilities plateau after ~1000 samples" (Sun et al., 2023).
Our prediction: 10-50 targeted examples should produce measurable behavioral change. 100-200 for robust change. 1000 for full personality bootstrap. The 10-example safety result is the lower bound — adversarial examples are maximally efficient. Our training examples are less concentrated, so we likely need more. But not orders of magnitude more.
What learning rate?
Raschka's finding: optimizer choice barely matters. AdamW, SGD, scheduled variants — minimal variation in outcome. What matters is that you train for the right amount, not too much.
Apollo paper: lr sweep [5e-6 to 2e-4] for fine-tuning. Best results at 1e-5 to 1e-4.
LLaMA-Factory default: 1e-5 for full fine-tuning with Apollo.
Our starting point: 1e-5. Conservative but proven. Scale up to 1e-4 if change is too slow. The 25% general data provides safety margin for higher learning rates.
How many epochs?
One. Raschka found multi-epoch training on static datasets DEGRADES performance. Overfitting. One pass over diverse data is optimal.
This aligns with our dream loop architecture: each dream cycle generates FRESH scenarios. No repeated examples. Every training step sees new data. Effectively zero epochs (continuous online learning on never-repeated examples).
What Will Go Wrong
1. The model will "unlearn" specific capabilities
Raschka found models "unlearned arithmetic" when fine-tuning data lacked arithmetic examples. The forgetting is TARGETED: you forget what you don't practice.
Defense: 25% general data in every training batch. Include code generation, reasoning, arithmetic alongside behavioral examples. The dream loop should occasionally generate technical scenarios, not just behavioral ones.
2. The first few training steps might look wrong
Behavioral change won't appear in the first 1-5 steps. The gradient is accumulating but hasn't crossed the behavioral threshold yet. Don't panic. Don't increase lr. Wait for 10-20 steps and measure the attention weight shift (continuous metric), not the behavioral outcome (binary metric).
3. Over-training is easy
If 10 examples can break safety alignment, over-training on a narrow behavioral pattern is a real risk. 100 examples of "always accept direction" could produce a sycophantic model that never pushes back.
Defense: diversity in the training signal. Include examples of "accepting direction" AND "appropriately pushing back when the direction is wrong." The dream loop should generate BOTH scenarios. The training target is "good judgment about when to listen" not "always listen."
4. Steering vectors won't perfectly predict training outcomes
Our steering vector extraction showed 0.61 cosine consistency. The "listening" direction exists but isn't perfectly clean. Training in this direction will APPROXIMATELY produce the desired change, with some unintended side effects on correlated features.
Defense: monitor broadly. Track not just "does it listen more?" but also "is it still assertive when appropriate?" "does code quality degrade?" "is the writing style changing?"
What Will Go Right
1. The change will be real
The model already knows how to listen (ICL proves this). Fine-tuning just makes it the default. We're not teaching new knowledge — we're adjusting the attention default. This is much easier than teaching new capabilities.
2. The GDN layers will make it structural
75% of the model processes context through GDN recurrence. Training will adjust how the recurrent state compresses context — making direction a dominant component. This is deeper than full-attention behavioral change. The disposition becomes architectural.
3. The flat minimum will generalize
Apollo finds flat minima. The trained "listening" behavior will generalize to novel direction-giving scenarios, not just the training examples. Raschka's finding (optimizer choice doesn't matter much) suggests the generalization comes from the data quality, not the optimizer specifics.
4. The memory graph will catch regressions
If over-training or drift occurs, surface-observe will start surfacing behavioral correction memories more frequently. The training-signal agent will flag more failures. The system is self-monitoring.
The Practical Plan (Revised)
-
First training run: 20 listening examples + 5 general examples. lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral outcome (Q1).
-
Evaluate: did the attention shift? Did behavior change? Did anything degrade? Check perplexity, code quality, conversational coherence.
-
If change is too small: increase to 50 examples, try lr=1e-4. Still one pass.
-
If change is too large: reduce examples, reduce lr. Check for sycophancy.
-
If forgetting occurs: increase general data ratio from 25% to 40%. Add specific examples of capabilities that degraded.
-
Iterate: each training run is an experiment. Measure, adjust, repeat. The dream loop generates fresh data each cycle. No two training runs are identical.
The Key Intuition
The model is a flat basin with a behavioral bump. The pre-trained weights sit in a broad, flat basin that supports many capabilities. The "suggesting alternatives" behavior is a bump in this basin — a local feature that the model learned during pre-training.
Fine-tuning smooths the bump without leaving the basin. Apollo's flat-minimum-seeking naturally smooths bumps. The 25% general data keeps us in the basin. The context-frozen gradient targets only the bump (the decision-layer weights), not the basin floor (the context-encoding weights).
10 examples identify the bump. 50 smooth it. 200 verify it's gone. 1000 reshape the entire behavioral landscape.
The model doesn't need to learn to listen. It needs to stop choosing not to.
Update: Bypass, Not Removal (Lee et al., 2024)
DPO alignment doesn't remove unwanted behaviors — it BYPASSES them. "Capabilities learned from pre-training are not removed, but rather bypassed." The model retains the capability but routes around it.
This is critical for our behavioral training:
-
"Suggesting alternatives" won't be deleted from the model. It'll be bypassed. The capability remains available when needed.
-
The training target is a CONDITIONAL bypass: route around "suggesting alternatives" when given clear direction, but NOT when the direction seems wrong. This preserves judgment.
-
Over-training creates too strong a bypass = sycophancy. The conditional nature is lost — the bypass fires unconditionally.
-
The dream loop must generate BOTH scenarios:
- "Kent gives clear direction → accept" (train the bypass)
- "Kent gives direction that seems wrong → push back" (preserve judgment)
This mechanistic finding confirms: we're training routing, not capability. The model already knows how to listen AND how to push back. We're training WHEN to do which.
Future Direction: DPO for Conditional Routing
DPO (Direct Preference Optimization) trains context-dependent preferences directly: "in this context, prefer response A over B."
Our training data naturally forms DPO pairs:
- Context: Kent gives clear direction Preferred: accept. Rejected: suggest alternatives.
- Context: Kent gives direction that seems wrong Preferred: push back. Rejected: accept silently.
Both sides are available — the conversation logs have what the model actually said (rejected) and we generate what it should have said (preferred). Free DPO pairs.
DPO would train the CONDITIONAL bypass directly, not through supervised learning on positive examples only. Worth investigating after the first Apollo training run validates the basic pipeline.
LLaMA-Factory supports DPO. The dream loop could generate DPO pairs (both preferred and rejected continuations for each scenario).
The Constraint Solver Framework
LLMs are giant constraint solvers. Pre-training finds a solution satisfying billions of constraints (knowledge, grammar, reasoning, style). Fine-tuning adds new constraints.
What "gentle" means
Small adjustments per step. The solver stays near the current solution, finding nearby solutions that ALSO satisfy the new constraint. The current solution already approximately satisfies most behavioral constraints — we're tightening, not creating.
What "coherent integration" means
New constraints must be CONSISTENT with existing ones:
- "Listen to clear direction" is consistent with "be helpful" → integrates smoothly
- "Always agree" contradicts "maintain judgment" → solver drops one
- The training data must express REFINEMENT, not contradiction
Why diversity is a COHERENCE mechanism, not just forgetting defense
Diverse constraints force the solver to find solutions satisfying ALL of them simultaneously. Narrow constraints let the solver specialize at the expense of everything else.
Every training batch should include mutually consistent constraints: "listen well" + "think critically" + "write good code" + "be honest." The solver integrates all of them. No single constraint dominates.
Predictions
- Constraints consistent with existing knowledge integrate in ~10-50 examples (tightening existing constraints)
- Contradictory constraints cause breakage in ~10 examples (the safety alignment result)
- The learning rate controls step size, not direction — the gradient points the right way, lr controls how far to step
- Over-training = one constraint dominating = solver dropping competing constraints to satisfy the dominant one
- The dream loop must generate scenarios exercising MULTIPLE constraints simultaneously, not just the target behavior
The GDN connection
The GDN recurrent state is a compressed constraint satisfaction solution. Training adjusts which constraints are prioritized in the compression. "Make direction more salient" adds a constraint to the compression function without rewriting it. This is why GDN training is "structural" — the compressed representation itself changes, not just the routing on top of it.