lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K
values adjusted per example. The coherent direction emerges from
many votes (examples). Apollo moments smooth the noise. DPO needs
lower lr because comparative votes are noisier than absolute votes.
SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config).
DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate.
Forgetting intensifies with model scale (our 27B is more susceptible).
Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7
for conditional routing. Conversation logs provide free DPO pairs.
Conservative approach with rollback safety net.
LLMs as constraint solvers. Fine-tuning adds constraints to an
existing solution. Gentle = small steps near the current solution.
Coherent = new constraints consistent with existing ones. Diversity
is a COHERENCE mechanism — forces the solver to satisfy all
constraints simultaneously. Over-training = one constraint
dominating = solver drops competing constraints. Predictions for
training behavior grounded in this framework.
DPO mechanistic finding: alignment doesn't remove behaviors, it
bypasses them. The capability stays; the routing changes. For us:
train CONDITIONAL bypass (listen when direction is clear, push back
when it seems wrong). Over-training = unconditional bypass = sycophancy.
Dream loop must generate both scenarios to preserve judgment.
10 examples broke safety alignment (Qi et al.). 1000 curated examples
matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka).
Models 'unlearn arithmetic' when training data lacks it.
Predictions: 10-50 examples for measurable change, one epoch,
lr=1e-5 to start. Over-training is easy (10 counter-examples undo
a disposition). Main risk: sycophancy from narrow training signal.
Defense: diverse examples including 'when to push back.'
Key intuition: the model doesn't need to learn to listen. It needs
to stop choosing not to.