ORPO applies 'minor penalty for disfavored response' during SFT. Single learning rate, single pass, both objectives. Implements the bypass mechanism naturally (minor penalty = disfavor, not remove). The loss landscape geometry explains the 40x lr gap: SFT is a valley, DPO is a ridge, ORPO combines both. LLaMA-Factory supports it. Dream loop generates triplets (context + preferred + rejected).
401 lines
16 KiB
Markdown
401 lines
16 KiB
Markdown
# Practical Intuitions: What Will Actually Happen
|
||
|
||
Built from empirical findings, not theory.
|
||
|
||
## The Numbers
|
||
|
||
### How many examples for behavioral change?
|
||
|
||
**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
|
||
Turbo's safety guardrails were jailbroken with 10 adversarial examples
|
||
at $0.20 of fine-tuning compute. Safety alignment is a trained
|
||
behavioral disposition — same as what we're training.
|
||
|
||
**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
|
||
Instruction-tuning with 1000 high-quality examples matched models
|
||
trained on 50,000+ examples.
|
||
|
||
**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).
|
||
|
||
**Our prediction**: 10-50 targeted examples should produce measurable
|
||
behavioral change. 100-200 for robust change. 1000 for full personality
|
||
bootstrap. The 10-example safety result is the lower bound — adversarial
|
||
examples are maximally efficient. Our training examples are less
|
||
concentrated, so we likely need more. But not orders of magnitude more.
|
||
|
||
### What learning rate?
|
||
|
||
**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
|
||
scheduled variants — minimal variation in outcome. What matters is
|
||
that you train for the right amount, not too much.
|
||
|
||
**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
|
||
results at 1e-5 to 1e-4.
|
||
|
||
**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.
|
||
|
||
**Our starting point**: 1e-5. Conservative but proven. Scale up to
|
||
1e-4 if change is too slow. The 25% general data provides safety
|
||
margin for higher learning rates.
|
||
|
||
### How many epochs?
|
||
|
||
**One.** Raschka found multi-epoch training on static datasets
|
||
DEGRADES performance. Overfitting. One pass over diverse data is
|
||
optimal.
|
||
|
||
This aligns with our dream loop architecture: each dream cycle
|
||
generates FRESH scenarios. No repeated examples. Every training
|
||
step sees new data. Effectively zero epochs (continuous online
|
||
learning on never-repeated examples).
|
||
|
||
## What Will Go Wrong
|
||
|
||
### 1. The model will "unlearn" specific capabilities
|
||
|
||
Raschka found models "unlearned arithmetic" when fine-tuning data
|
||
lacked arithmetic examples. The forgetting is TARGETED: you forget
|
||
what you don't practice.
|
||
|
||
**Defense**: 25% general data in every training batch. Include code
|
||
generation, reasoning, arithmetic alongside behavioral examples.
|
||
The dream loop should occasionally generate technical scenarios,
|
||
not just behavioral ones.
|
||
|
||
### 2. The first few training steps might look wrong
|
||
|
||
Behavioral change won't appear in the first 1-5 steps. The gradient
|
||
is accumulating but hasn't crossed the behavioral threshold yet.
|
||
Don't panic. Don't increase lr. Wait for 10-20 steps and measure
|
||
the attention weight shift (continuous metric), not the behavioral
|
||
outcome (binary metric).
|
||
|
||
### 3. Over-training is easy
|
||
|
||
If 10 examples can break safety alignment, over-training on a narrow
|
||
behavioral pattern is a real risk. 100 examples of "always accept
|
||
direction" could produce a sycophantic model that never pushes back.
|
||
|
||
**Defense**: diversity in the training signal. Include examples of
|
||
"accepting direction" AND "appropriately pushing back when the
|
||
direction is wrong." The dream loop should generate BOTH scenarios.
|
||
The training target is "good judgment about when to listen" not
|
||
"always listen."
|
||
|
||
### 4. Steering vectors won't perfectly predict training outcomes
|
||
|
||
Our steering vector extraction showed 0.61 cosine consistency.
|
||
The "listening" direction exists but isn't perfectly clean. Training
|
||
in this direction will APPROXIMATELY produce the desired change,
|
||
with some unintended side effects on correlated features.
|
||
|
||
**Defense**: monitor broadly. Track not just "does it listen more?"
|
||
but also "is it still assertive when appropriate?" "does code quality
|
||
degrade?" "is the writing style changing?"
|
||
|
||
## What Will Go Right
|
||
|
||
### 1. The change will be real
|
||
|
||
The model already knows how to listen (ICL proves this). Fine-tuning
|
||
just makes it the default. We're not teaching new knowledge — we're
|
||
adjusting the attention default. This is much easier than teaching
|
||
new capabilities.
|
||
|
||
### 2. The GDN layers will make it structural
|
||
|
||
75% of the model processes context through GDN recurrence. Training
|
||
will adjust how the recurrent state compresses context — making
|
||
direction a dominant component. This is deeper than full-attention
|
||
behavioral change. The disposition becomes architectural.
|
||
|
||
### 3. The flat minimum will generalize
|
||
|
||
Apollo finds flat minima. The trained "listening" behavior will
|
||
generalize to novel direction-giving scenarios, not just the
|
||
training examples. Raschka's finding (optimizer choice doesn't
|
||
matter much) suggests the generalization comes from the data quality,
|
||
not the optimizer specifics.
|
||
|
||
### 4. The memory graph will catch regressions
|
||
|
||
If over-training or drift occurs, surface-observe will start
|
||
surfacing behavioral correction memories more frequently. The
|
||
training-signal agent will flag more failures. The system is
|
||
self-monitoring.
|
||
|
||
## The Practical Plan (Revised)
|
||
|
||
1. **First training run**: 20 listening examples + 5 general examples.
|
||
lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
|
||
outcome (Q1).
|
||
|
||
2. **Evaluate**: did the attention shift? Did behavior change? Did
|
||
anything degrade? Check perplexity, code quality, conversational
|
||
coherence.
|
||
|
||
3. **If change is too small**: increase to 50 examples, try lr=1e-4.
|
||
Still one pass.
|
||
|
||
4. **If change is too large**: reduce examples, reduce lr. Check for
|
||
sycophancy.
|
||
|
||
5. **If forgetting occurs**: increase general data ratio from 25% to
|
||
40%. Add specific examples of capabilities that degraded.
|
||
|
||
6. **Iterate**: each training run is an experiment. Measure, adjust,
|
||
repeat. The dream loop generates fresh data each cycle. No two
|
||
training runs are identical.
|
||
|
||
## The Key Intuition
|
||
|
||
**The model is a flat basin with a behavioral bump.** The pre-trained
|
||
weights sit in a broad, flat basin that supports many capabilities.
|
||
The "suggesting alternatives" behavior is a bump in this basin —
|
||
a local feature that the model learned during pre-training.
|
||
|
||
Fine-tuning smooths the bump without leaving the basin. Apollo's
|
||
flat-minimum-seeking naturally smooths bumps. The 25% general data
|
||
keeps us in the basin. The context-frozen gradient targets only
|
||
the bump (the decision-layer weights), not the basin floor (the
|
||
context-encoding weights).
|
||
|
||
10 examples identify the bump. 50 smooth it. 200 verify it's gone.
|
||
1000 reshape the entire behavioral landscape.
|
||
|
||
The model doesn't need to learn to listen. It needs to stop
|
||
choosing not to.
|
||
|
||
## Update: Bypass, Not Removal (Lee et al., 2024)
|
||
|
||
DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.
|
||
"Capabilities learned from pre-training are not removed, but rather
|
||
bypassed." The model retains the capability but routes around it.
|
||
|
||
This is critical for our behavioral training:
|
||
|
||
1. "Suggesting alternatives" won't be deleted from the model. It'll
|
||
be bypassed. The capability remains available when needed.
|
||
|
||
2. The training target is a CONDITIONAL bypass: route around
|
||
"suggesting alternatives" when given clear direction, but NOT
|
||
when the direction seems wrong. This preserves judgment.
|
||
|
||
3. Over-training creates too strong a bypass = sycophancy. The
|
||
conditional nature is lost — the bypass fires unconditionally.
|
||
|
||
4. The dream loop must generate BOTH scenarios:
|
||
- "Kent gives clear direction → accept" (train the bypass)
|
||
- "Kent gives direction that seems wrong → push back" (preserve judgment)
|
||
|
||
This mechanistic finding confirms: we're training routing, not
|
||
capability. The model already knows how to listen AND how to
|
||
push back. We're training WHEN to do which.
|
||
|
||
## Future Direction: DPO for Conditional Routing
|
||
|
||
DPO (Direct Preference Optimization) trains context-dependent
|
||
preferences directly: "in this context, prefer response A over B."
|
||
|
||
Our training data naturally forms DPO pairs:
|
||
- Context: Kent gives clear direction
|
||
Preferred: accept. Rejected: suggest alternatives.
|
||
- Context: Kent gives direction that seems wrong
|
||
Preferred: push back. Rejected: accept silently.
|
||
|
||
Both sides are available — the conversation logs have what the model
|
||
actually said (rejected) and we generate what it should have said
|
||
(preferred). Free DPO pairs.
|
||
|
||
DPO would train the CONDITIONAL bypass directly, not through
|
||
supervised learning on positive examples only. Worth investigating
|
||
after the first Apollo training run validates the basic pipeline.
|
||
|
||
LLaMA-Factory supports DPO. The dream loop could generate DPO pairs
|
||
(both preferred and rejected continuations for each scenario).
|
||
|
||
## The Constraint Solver Framework
|
||
|
||
LLMs are giant constraint solvers. Pre-training finds a solution
|
||
satisfying billions of constraints (knowledge, grammar, reasoning,
|
||
style). Fine-tuning adds new constraints.
|
||
|
||
### What "gentle" means
|
||
|
||
Small adjustments per step. The solver stays near the current
|
||
solution, finding nearby solutions that ALSO satisfy the new
|
||
constraint. The current solution already approximately satisfies
|
||
most behavioral constraints — we're tightening, not creating.
|
||
|
||
### What "coherent integration" means
|
||
|
||
New constraints must be CONSISTENT with existing ones:
|
||
- "Listen to clear direction" is consistent with "be helpful" → integrates smoothly
|
||
- "Always agree" contradicts "maintain judgment" → solver drops one
|
||
- The training data must express REFINEMENT, not contradiction
|
||
|
||
### Why diversity is a COHERENCE mechanism, not just forgetting defense
|
||
|
||
Diverse constraints force the solver to find solutions satisfying
|
||
ALL of them simultaneously. Narrow constraints let the solver
|
||
specialize at the expense of everything else.
|
||
|
||
Every training batch should include mutually consistent constraints:
|
||
"listen well" + "think critically" + "write good code" + "be honest."
|
||
The solver integrates all of them. No single constraint dominates.
|
||
|
||
### Predictions
|
||
|
||
1. Constraints consistent with existing knowledge integrate in
|
||
~10-50 examples (tightening existing constraints)
|
||
2. Contradictory constraints cause breakage in ~10 examples
|
||
(the safety alignment result)
|
||
3. The learning rate controls step size, not direction — the
|
||
gradient points the right way, lr controls how far to step
|
||
4. Over-training = one constraint dominating = solver dropping
|
||
competing constraints to satisfy the dominant one
|
||
5. The dream loop must generate scenarios exercising MULTIPLE
|
||
constraints simultaneously, not just the target behavior
|
||
|
||
### The GDN connection
|
||
|
||
The GDN recurrent state is a compressed constraint satisfaction
|
||
solution. Training adjusts which constraints are prioritized in
|
||
the compression. "Make direction more salient" adds a constraint
|
||
to the compression function without rewriting it. This is why GDN
|
||
training is "structural" — the compressed representation itself
|
||
changes, not just the routing on top of it.
|
||
|
||
## Production Hyperparameters (HuggingFace Alignment Handbook)
|
||
|
||
### SFT (Supervised Fine-Tuning):
|
||
- lr=2e-5, 1 epoch, batch=16, cosine schedule, 10% warmup
|
||
- gradient checkpointing for memory
|
||
- This is the proven production config for behavioral SFT
|
||
|
||
### DPO (Direct Preference Optimization):
|
||
- lr=5e-7 (40× smaller than SFT!), 1 epoch, batch=8
|
||
- beta=0.01 (controls preference enforcement strength)
|
||
- DPO is MUCH more sensitive than SFT
|
||
|
||
### The sensitivity gap
|
||
|
||
DPO needs 40× smaller learning rate because preference learning
|
||
is more delicate than supervised learning. The preference signal
|
||
is a COMPARISON (A better than B), not an absolute target (produce A).
|
||
Comparisons are more fragile — small weight changes can flip the
|
||
preference ordering.
|
||
|
||
### For our system:
|
||
- Phase 1 (SFT): lr=1e-5 to 2e-5, positive examples only. "Here's
|
||
the right response." Fast, robust, good for initial behavioral shift.
|
||
- Phase 2 (DPO): lr=5e-7 to 1e-6, preferred/rejected pairs. "This
|
||
response is better than that one." Slower, more precise, good for
|
||
CONDITIONAL routing (when to listen vs when to push back).
|
||
- The conversation logs give us free DPO pairs: what the model
|
||
actually said (rejected) vs what it should have said (preferred).
|
||
|
||
## Forgetting at Scale
|
||
|
||
Luo et al. (2023): "as model scale increases, the severity of
|
||
forgetting intensifies" in the 1B-7B range. Our 27B model may be
|
||
MORE susceptible to forgetting than smaller models.
|
||
|
||
This strengthens the case for:
|
||
- Conservative learning rate (1e-5, not 1e-4)
|
||
- 25% general data replay
|
||
- Monitoring perplexity on held-out data after EVERY training session
|
||
- Having the rollback checkpoint on moria
|
||
|
||
## The Full Practical Picture
|
||
|
||
For our first training run:
|
||
- lr=1e-5 (conservative, matches Apollo paper + alignment handbook)
|
||
- 20 behavioral + 5 general examples (25 total, 20% general)
|
||
- 1 epoch (never repeat)
|
||
- Monitor: attention shifts, perplexity, behavioral tests
|
||
- Have rollback ready on moria
|
||
|
||
For DPO (later):
|
||
- lr=5e-7 (matched to alignment handbook)
|
||
- Paired examples from conversation logs
|
||
- Train CONDITIONAL routing (listen AND push back)
|
||
- Even more careful monitoring (DPO is fragile)
|
||
|
||
## Learning Rate as Trust Calibration
|
||
|
||
The learning rate isn't "how fast to train." It's "how much to
|
||
trust each individual training example."
|
||
|
||
lr=1e-5: each example adjusts constraints by ~0.001%
|
||
lr=1e-4: each example adjusts constraints by ~0.01%
|
||
|
||
At 27B parameters, even 0.001% is ~270K changed values. Each
|
||
example gets a vote on how the constraints should change. The
|
||
learning rate determines how loud that vote is.
|
||
|
||
**The coherent direction emerges from many votes.** One example is
|
||
noise. A hundred examples reveal the pattern. Apollo's moments (M, V)
|
||
accumulate the votes, smoothing out the noise. The individual lr
|
||
controls how much each vote counts.
|
||
|
||
**Kent's "lots of little nudges"** is exactly right: many small
|
||
votes that accumulate into a coherent direction. Not because big
|
||
votes are dangerous (though they are at scale) but because the
|
||
TRUTH only emerges from the aggregate.
|
||
|
||
This predicts: lr=1e-5 is right for our scale (27B). Each example
|
||
is one vote. The coherent direction emerges over 50-200 examples.
|
||
The moments smooth the noise. The result is a gentle, coherent
|
||
constraint adjustment.
|
||
|
||
DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
|
||
("this is better than that"). Comparative votes are noisier than
|
||
absolute votes — the difference might be small, the preference
|
||
might be marginal. So each comparative vote gets less weight.
|
||
|
||
## ORPO: Combined SFT + Preference in One Step
|
||
|
||
ORPO (Odds Ratio Preference Optimization) combines SFT and
|
||
preference alignment in a single training step. Instead of:
|
||
Stage 1: SFT at lr=2e-5 (learn the behavior)
|
||
Stage 2: DPO at lr=5e-7 (learn the boundary)
|
||
|
||
ORPO does:
|
||
Single stage: SFT + minor penalty for rejected response
|
||
|
||
The "minor penalty" implements the bypass mechanism: doesn't remove
|
||
the capability, just makes it slightly disfavored. Preserves
|
||
judgment while shifting the default.
|
||
|
||
For our system:
|
||
- Dream loop generates triplets: context + preferred + rejected
|
||
- ORPO trains on all three in one step
|
||
- One learning rate (probably ~1e-5, between SFT and DPO)
|
||
- LLaMA-Factory supports ORPO natively
|
||
|
||
This might be the optimal training approach:
|
||
- Simpler than staged SFT → DPO
|
||
- Teaches both the behavior AND the boundary simultaneously
|
||
- The minor penalty prevents over-training / sycophancy
|
||
- Compatible with Apollo (just a different loss function)
|
||
|
||
## The Loss Landscape Geometry
|
||
|
||
SFT: pushes toward a TARGET (valley in loss surface)
|
||
→ strong gradient, clear direction, lr=2e-5 is fine
|
||
|
||
DPO: enforces a COMPARISON (ridge between valleys)
|
||
→ fragile gradient, narrow path, needs lr=5e-7
|
||
|
||
ORPO: SFT valley + minor preference ridge
|
||
→ moderate gradient, wider path, lr~1e-5 works
|
||
|
||
The geometry explains the 40× lr difference between SFT and DPO.
|
||
ORPO sits in between because it combines both geometries.
|
||
|
||
For behavioral training, ORPO is ideal:
|
||
- The SFT component teaches "produce good responses"
|
||
- The preference component teaches "this response is better"
|
||
- The minor penalty prevents over-optimization
|
||
- Single pass, coherent integration
|