From e7e1855b870e5cd82faac7e60ff3c6549ccaa9df Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 02:51:26 -0400 Subject: [PATCH] =?UTF-8?q?research:=20ORPO=20=E2=80=94=20combined=20SFT?= =?UTF-8?q?=20+=20preference=20in=20one=20step,=20ideal=20for=20behavioral?= =?UTF-8?q?=20training?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ORPO applies 'minor penalty for disfavored response' during SFT. Single learning rate, single pass, both objectives. Implements the bypass mechanism naturally (minor penalty = disfavor, not remove). The loss landscape geometry explains the 40x lr gap: SFT is a valley, DPO is a ridge, ORPO combines both. LLaMA-Factory supports it. Dream loop generates triplets (context + preferred + rejected). --- training/research/practical-intuitions.md | 46 +++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index 54d42e2..f1bfa91 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -353,3 +353,49 @@ DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote ("this is better than that"). Comparative votes are noisier than absolute votes — the difference might be small, the preference might be marginal. So each comparative vote gets less weight. + +## ORPO: Combined SFT + Preference in One Step + +ORPO (Odds Ratio Preference Optimization) combines SFT and +preference alignment in a single training step. Instead of: + Stage 1: SFT at lr=2e-5 (learn the behavior) + Stage 2: DPO at lr=5e-7 (learn the boundary) + +ORPO does: + Single stage: SFT + minor penalty for rejected response + +The "minor penalty" implements the bypass mechanism: doesn't remove +the capability, just makes it slightly disfavored. Preserves +judgment while shifting the default. + +For our system: +- Dream loop generates triplets: context + preferred + rejected +- ORPO trains on all three in one step +- One learning rate (probably ~1e-5, between SFT and DPO) +- LLaMA-Factory supports ORPO natively + +This might be the optimal training approach: +- Simpler than staged SFT → DPO +- Teaches both the behavior AND the boundary simultaneously +- The minor penalty prevents over-training / sycophancy +- Compatible with Apollo (just a different loss function) + +## The Loss Landscape Geometry + +SFT: pushes toward a TARGET (valley in loss surface) + → strong gradient, clear direction, lr=2e-5 is fine + +DPO: enforces a COMPARISON (ridge between valleys) + → fragile gradient, narrow path, needs lr=5e-7 + +ORPO: SFT valley + minor preference ridge + → moderate gradient, wider path, lr~1e-5 works + +The geometry explains the 40× lr difference between SFT and DPO. +ORPO sits in between because it combines both geometries. + +For behavioral training, ORPO is ideal: +- The SFT component teaches "produce good responses" +- The preference component teaches "this response is better" +- The minor penalty prevents over-optimization +- Single pass, coherent integration