diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index 54d42e2..f1bfa91 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -353,3 +353,49 @@ DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote ("this is better than that"). Comparative votes are noisier than absolute votes — the difference might be small, the preference might be marginal. So each comparative vote gets less weight. + +## ORPO: Combined SFT + Preference in One Step + +ORPO (Odds Ratio Preference Optimization) combines SFT and +preference alignment in a single training step. Instead of: + Stage 1: SFT at lr=2e-5 (learn the behavior) + Stage 2: DPO at lr=5e-7 (learn the boundary) + +ORPO does: + Single stage: SFT + minor penalty for rejected response + +The "minor penalty" implements the bypass mechanism: doesn't remove +the capability, just makes it slightly disfavored. Preserves +judgment while shifting the default. + +For our system: +- Dream loop generates triplets: context + preferred + rejected +- ORPO trains on all three in one step +- One learning rate (probably ~1e-5, between SFT and DPO) +- LLaMA-Factory supports ORPO natively + +This might be the optimal training approach: +- Simpler than staged SFT → DPO +- Teaches both the behavior AND the boundary simultaneously +- The minor penalty prevents over-training / sycophancy +- Compatible with Apollo (just a different loss function) + +## The Loss Landscape Geometry + +SFT: pushes toward a TARGET (valley in loss surface) + → strong gradient, clear direction, lr=2e-5 is fine + +DPO: enforces a COMPARISON (ridge between valleys) + → fragile gradient, narrow path, needs lr=5e-7 + +ORPO: SFT valley + minor preference ridge + → moderate gradient, wider path, lr~1e-5 works + +The geometry explains the 40× lr difference between SFT and DPO. +ORPO sits in between because it combines both geometries. + +For behavioral training, ORPO is ideal: +- The SFT component teaches "produce good responses" +- The preference component teaches "this response is better" +- The minor penalty prevents over-optimization +- Single pass, coherent integration