SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config).
DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate.
Forgetting intensifies with model scale (our 27B is more susceptible).
Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7
for conditional routing. Conversation logs provide free DPO pairs.
Conservative approach with rollback safety net.
LLMs as constraint solvers. Fine-tuning adds constraints to an
existing solution. Gentle = small steps near the current solution.
Coherent = new constraints consistent with existing ones. Diversity
is a COHERENCE mechanism — forces the solver to satisfy all
constraints simultaneously. Over-training = one constraint
dominating = solver drops competing constraints. Predictions for
training behavior grounded in this framework.
DPO mechanistic finding: alignment doesn't remove behaviors, it
bypasses them. The capability stays; the routing changes. For us:
train CONDITIONAL bypass (listen when direction is clear, push back
when it seems wrong). Over-training = unconditional bypass = sycophancy.
Dream loop must generate both scenarios to preserve judgment.
10 examples broke safety alignment (Qi et al.). 1000 curated examples
matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka).
Models 'unlearn arithmetic' when training data lacks it.
Predictions: 10-50 examples for measurable change, one epoch,
lr=1e-5 to start. Over-training is easy (10 counter-examples undo
a disposition). Main risk: sycophancy from narrow training signal.
Defense: diverse examples including 'when to push back.'
Key intuition: the model doesn't need to learn to listen. It needs
to stop choosing not to.