SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config). DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate. Forgetting intensifies with model scale (our 27B is more susceptible). Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7 for conditional routing. Conversation logs provide free DPO pairs. Conservative approach with rollback safety net. |
||
|---|---|---|
| .. | ||
| v0 | ||
| apollo-paper-analysis.md | ||
| context-frozen-training.md | ||
| gdn-gradient-flow.md | ||
| gradient-flow-frozen-context.md | ||
| hogwild-convergence.md | ||
| OPEN-QUESTIONS.md | ||
| practical-intuitions.md | ||
| steering-vectors-bridge.md | ||
| SUMMARY.md | ||
| task-vectors-model-merging.md | ||