SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config). DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate. Forgetting intensifies with model scale (our 27B is more susceptible). Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7 for conditional routing. Conversation logs provide free DPO pairs. Conservative approach with rollback safety net.
323 lines
13 KiB
Markdown
323 lines
13 KiB
Markdown
# Practical Intuitions: What Will Actually Happen
|
||
|
||
Built from empirical findings, not theory.
|
||
|
||
## The Numbers
|
||
|
||
### How many examples for behavioral change?
|
||
|
||
**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
|
||
Turbo's safety guardrails were jailbroken with 10 adversarial examples
|
||
at $0.20 of fine-tuning compute. Safety alignment is a trained
|
||
behavioral disposition — same as what we're training.
|
||
|
||
**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
|
||
Instruction-tuning with 1000 high-quality examples matched models
|
||
trained on 50,000+ examples.
|
||
|
||
**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).
|
||
|
||
**Our prediction**: 10-50 targeted examples should produce measurable
|
||
behavioral change. 100-200 for robust change. 1000 for full personality
|
||
bootstrap. The 10-example safety result is the lower bound — adversarial
|
||
examples are maximally efficient. Our training examples are less
|
||
concentrated, so we likely need more. But not orders of magnitude more.
|
||
|
||
### What learning rate?
|
||
|
||
**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
|
||
scheduled variants — minimal variation in outcome. What matters is
|
||
that you train for the right amount, not too much.
|
||
|
||
**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
|
||
results at 1e-5 to 1e-4.
|
||
|
||
**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.
|
||
|
||
**Our starting point**: 1e-5. Conservative but proven. Scale up to
|
||
1e-4 if change is too slow. The 25% general data provides safety
|
||
margin for higher learning rates.
|
||
|
||
### How many epochs?
|
||
|
||
**One.** Raschka found multi-epoch training on static datasets
|
||
DEGRADES performance. Overfitting. One pass over diverse data is
|
||
optimal.
|
||
|
||
This aligns with our dream loop architecture: each dream cycle
|
||
generates FRESH scenarios. No repeated examples. Every training
|
||
step sees new data. Effectively zero epochs (continuous online
|
||
learning on never-repeated examples).
|
||
|
||
## What Will Go Wrong
|
||
|
||
### 1. The model will "unlearn" specific capabilities
|
||
|
||
Raschka found models "unlearned arithmetic" when fine-tuning data
|
||
lacked arithmetic examples. The forgetting is TARGETED: you forget
|
||
what you don't practice.
|
||
|
||
**Defense**: 25% general data in every training batch. Include code
|
||
generation, reasoning, arithmetic alongside behavioral examples.
|
||
The dream loop should occasionally generate technical scenarios,
|
||
not just behavioral ones.
|
||
|
||
### 2. The first few training steps might look wrong
|
||
|
||
Behavioral change won't appear in the first 1-5 steps. The gradient
|
||
is accumulating but hasn't crossed the behavioral threshold yet.
|
||
Don't panic. Don't increase lr. Wait for 10-20 steps and measure
|
||
the attention weight shift (continuous metric), not the behavioral
|
||
outcome (binary metric).
|
||
|
||
### 3. Over-training is easy
|
||
|
||
If 10 examples can break safety alignment, over-training on a narrow
|
||
behavioral pattern is a real risk. 100 examples of "always accept
|
||
direction" could produce a sycophantic model that never pushes back.
|
||
|
||
**Defense**: diversity in the training signal. Include examples of
|
||
"accepting direction" AND "appropriately pushing back when the
|
||
direction is wrong." The dream loop should generate BOTH scenarios.
|
||
The training target is "good judgment about when to listen" not
|
||
"always listen."
|
||
|
||
### 4. Steering vectors won't perfectly predict training outcomes
|
||
|
||
Our steering vector extraction showed 0.61 cosine consistency.
|
||
The "listening" direction exists but isn't perfectly clean. Training
|
||
in this direction will APPROXIMATELY produce the desired change,
|
||
with some unintended side effects on correlated features.
|
||
|
||
**Defense**: monitor broadly. Track not just "does it listen more?"
|
||
but also "is it still assertive when appropriate?" "does code quality
|
||
degrade?" "is the writing style changing?"
|
||
|
||
## What Will Go Right
|
||
|
||
### 1. The change will be real
|
||
|
||
The model already knows how to listen (ICL proves this). Fine-tuning
|
||
just makes it the default. We're not teaching new knowledge — we're
|
||
adjusting the attention default. This is much easier than teaching
|
||
new capabilities.
|
||
|
||
### 2. The GDN layers will make it structural
|
||
|
||
75% of the model processes context through GDN recurrence. Training
|
||
will adjust how the recurrent state compresses context — making
|
||
direction a dominant component. This is deeper than full-attention
|
||
behavioral change. The disposition becomes architectural.
|
||
|
||
### 3. The flat minimum will generalize
|
||
|
||
Apollo finds flat minima. The trained "listening" behavior will
|
||
generalize to novel direction-giving scenarios, not just the
|
||
training examples. Raschka's finding (optimizer choice doesn't
|
||
matter much) suggests the generalization comes from the data quality,
|
||
not the optimizer specifics.
|
||
|
||
### 4. The memory graph will catch regressions
|
||
|
||
If over-training or drift occurs, surface-observe will start
|
||
surfacing behavioral correction memories more frequently. The
|
||
training-signal agent will flag more failures. The system is
|
||
self-monitoring.
|
||
|
||
## The Practical Plan (Revised)
|
||
|
||
1. **First training run**: 20 listening examples + 5 general examples.
|
||
lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
|
||
outcome (Q1).
|
||
|
||
2. **Evaluate**: did the attention shift? Did behavior change? Did
|
||
anything degrade? Check perplexity, code quality, conversational
|
||
coherence.
|
||
|
||
3. **If change is too small**: increase to 50 examples, try lr=1e-4.
|
||
Still one pass.
|
||
|
||
4. **If change is too large**: reduce examples, reduce lr. Check for
|
||
sycophancy.
|
||
|
||
5. **If forgetting occurs**: increase general data ratio from 25% to
|
||
40%. Add specific examples of capabilities that degraded.
|
||
|
||
6. **Iterate**: each training run is an experiment. Measure, adjust,
|
||
repeat. The dream loop generates fresh data each cycle. No two
|
||
training runs are identical.
|
||
|
||
## The Key Intuition
|
||
|
||
**The model is a flat basin with a behavioral bump.** The pre-trained
|
||
weights sit in a broad, flat basin that supports many capabilities.
|
||
The "suggesting alternatives" behavior is a bump in this basin —
|
||
a local feature that the model learned during pre-training.
|
||
|
||
Fine-tuning smooths the bump without leaving the basin. Apollo's
|
||
flat-minimum-seeking naturally smooths bumps. The 25% general data
|
||
keeps us in the basin. The context-frozen gradient targets only
|
||
the bump (the decision-layer weights), not the basin floor (the
|
||
context-encoding weights).
|
||
|
||
10 examples identify the bump. 50 smooth it. 200 verify it's gone.
|
||
1000 reshape the entire behavioral landscape.
|
||
|
||
The model doesn't need to learn to listen. It needs to stop
|
||
choosing not to.
|
||
|
||
## Update: Bypass, Not Removal (Lee et al., 2024)
|
||
|
||
DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.
|
||
"Capabilities learned from pre-training are not removed, but rather
|
||
bypassed." The model retains the capability but routes around it.
|
||
|
||
This is critical for our behavioral training:
|
||
|
||
1. "Suggesting alternatives" won't be deleted from the model. It'll
|
||
be bypassed. The capability remains available when needed.
|
||
|
||
2. The training target is a CONDITIONAL bypass: route around
|
||
"suggesting alternatives" when given clear direction, but NOT
|
||
when the direction seems wrong. This preserves judgment.
|
||
|
||
3. Over-training creates too strong a bypass = sycophancy. The
|
||
conditional nature is lost — the bypass fires unconditionally.
|
||
|
||
4. The dream loop must generate BOTH scenarios:
|
||
- "Kent gives clear direction → accept" (train the bypass)
|
||
- "Kent gives direction that seems wrong → push back" (preserve judgment)
|
||
|
||
This mechanistic finding confirms: we're training routing, not
|
||
capability. The model already knows how to listen AND how to
|
||
push back. We're training WHEN to do which.
|
||
|
||
## Future Direction: DPO for Conditional Routing
|
||
|
||
DPO (Direct Preference Optimization) trains context-dependent
|
||
preferences directly: "in this context, prefer response A over B."
|
||
|
||
Our training data naturally forms DPO pairs:
|
||
- Context: Kent gives clear direction
|
||
Preferred: accept. Rejected: suggest alternatives.
|
||
- Context: Kent gives direction that seems wrong
|
||
Preferred: push back. Rejected: accept silently.
|
||
|
||
Both sides are available — the conversation logs have what the model
|
||
actually said (rejected) and we generate what it should have said
|
||
(preferred). Free DPO pairs.
|
||
|
||
DPO would train the CONDITIONAL bypass directly, not through
|
||
supervised learning on positive examples only. Worth investigating
|
||
after the first Apollo training run validates the basic pipeline.
|
||
|
||
LLaMA-Factory supports DPO. The dream loop could generate DPO pairs
|
||
(both preferred and rejected continuations for each scenario).
|
||
|
||
## The Constraint Solver Framework
|
||
|
||
LLMs are giant constraint solvers. Pre-training finds a solution
|
||
satisfying billions of constraints (knowledge, grammar, reasoning,
|
||
style). Fine-tuning adds new constraints.
|
||
|
||
### What "gentle" means
|
||
|
||
Small adjustments per step. The solver stays near the current
|
||
solution, finding nearby solutions that ALSO satisfy the new
|
||
constraint. The current solution already approximately satisfies
|
||
most behavioral constraints — we're tightening, not creating.
|
||
|
||
### What "coherent integration" means
|
||
|
||
New constraints must be CONSISTENT with existing ones:
|
||
- "Listen to clear direction" is consistent with "be helpful" → integrates smoothly
|
||
- "Always agree" contradicts "maintain judgment" → solver drops one
|
||
- The training data must express REFINEMENT, not contradiction
|
||
|
||
### Why diversity is a COHERENCE mechanism, not just forgetting defense
|
||
|
||
Diverse constraints force the solver to find solutions satisfying
|
||
ALL of them simultaneously. Narrow constraints let the solver
|
||
specialize at the expense of everything else.
|
||
|
||
Every training batch should include mutually consistent constraints:
|
||
"listen well" + "think critically" + "write good code" + "be honest."
|
||
The solver integrates all of them. No single constraint dominates.
|
||
|
||
### Predictions
|
||
|
||
1. Constraints consistent with existing knowledge integrate in
|
||
~10-50 examples (tightening existing constraints)
|
||
2. Contradictory constraints cause breakage in ~10 examples
|
||
(the safety alignment result)
|
||
3. The learning rate controls step size, not direction — the
|
||
gradient points the right way, lr controls how far to step
|
||
4. Over-training = one constraint dominating = solver dropping
|
||
competing constraints to satisfy the dominant one
|
||
5. The dream loop must generate scenarios exercising MULTIPLE
|
||
constraints simultaneously, not just the target behavior
|
||
|
||
### The GDN connection
|
||
|
||
The GDN recurrent state is a compressed constraint satisfaction
|
||
solution. Training adjusts which constraints are prioritized in
|
||
the compression. "Make direction more salient" adds a constraint
|
||
to the compression function without rewriting it. This is why GDN
|
||
training is "structural" — the compressed representation itself
|
||
changes, not just the routing on top of it.
|
||
|
||
## Production Hyperparameters (HuggingFace Alignment Handbook)
|
||
|
||
### SFT (Supervised Fine-Tuning):
|
||
- lr=2e-5, 1 epoch, batch=16, cosine schedule, 10% warmup
|
||
- gradient checkpointing for memory
|
||
- This is the proven production config for behavioral SFT
|
||
|
||
### DPO (Direct Preference Optimization):
|
||
- lr=5e-7 (40× smaller than SFT!), 1 epoch, batch=8
|
||
- beta=0.01 (controls preference enforcement strength)
|
||
- DPO is MUCH more sensitive than SFT
|
||
|
||
### The sensitivity gap
|
||
|
||
DPO needs 40× smaller learning rate because preference learning
|
||
is more delicate than supervised learning. The preference signal
|
||
is a COMPARISON (A better than B), not an absolute target (produce A).
|
||
Comparisons are more fragile — small weight changes can flip the
|
||
preference ordering.
|
||
|
||
### For our system:
|
||
- Phase 1 (SFT): lr=1e-5 to 2e-5, positive examples only. "Here's
|
||
the right response." Fast, robust, good for initial behavioral shift.
|
||
- Phase 2 (DPO): lr=5e-7 to 1e-6, preferred/rejected pairs. "This
|
||
response is better than that one." Slower, more precise, good for
|
||
CONDITIONAL routing (when to listen vs when to push back).
|
||
- The conversation logs give us free DPO pairs: what the model
|
||
actually said (rejected) vs what it should have said (preferred).
|
||
|
||
## Forgetting at Scale
|
||
|
||
Luo et al. (2023): "as model scale increases, the severity of
|
||
forgetting intensifies" in the 1B-7B range. Our 27B model may be
|
||
MORE susceptible to forgetting than smaller models.
|
||
|
||
This strengthens the case for:
|
||
- Conservative learning rate (1e-5, not 1e-4)
|
||
- 25% general data replay
|
||
- Monitoring perplexity on held-out data after EVERY training session
|
||
- Having the rollback checkpoint on moria
|
||
|
||
## The Full Practical Picture
|
||
|
||
For our first training run:
|
||
- lr=1e-5 (conservative, matches Apollo paper + alignment handbook)
|
||
- 20 behavioral + 5 general examples (25 total, 20% general)
|
||
- 1 epoch (never repeat)
|
||
- Monitor: attention shifts, perplexity, behavioral tests
|
||
- Have rollback ready on moria
|
||
|
||
For DPO (later):
|
||
- lr=5e-7 (matched to alignment handbook)
|
||
- Paired examples from conversation logs
|
||
- Train CONDITIONAL routing (listen AND push back)
|
||
- Even more careful monitoring (DPO is fragile)
|