research: ORPO — combined SFT + preference in one step, ideal for behavioral training

ORPO applies 'minor penalty for disfavored response' during SFT. Single learning rate, single pass, both objectives. Implements the bypass mechanism naturally (minor penalty = disfavor, not remove). The loss landscape geometry explains the 40x lr gap: SFT is a valley, DPO is a ridge, ORPO combines both. LLaMA-Factory supports it. Dream loop generates triplets (context + preferred + rejected).
2026-03-31 02:51:26 -04:00 · 2026-03-31 02:51:26 -04:00 · e7e1855b87
commit e7e1855b87
parent 3be20062d1
1 changed files with 46 additions and 0 deletions
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@ -353,3 +353,49 @@ DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
 ("this is better than that"). Comparative votes are noisier than
 absolute votes — the difference might be small, the preference
 might be marginal. So each comparative vote gets less weight.
+
+## ORPO: Combined SFT + Preference in One Step
+
+ORPO (Odds Ratio Preference Optimization) combines SFT and
+preference alignment in a single training step. Instead of:
+  Stage 1: SFT at lr=2e-5 (learn the behavior)
+  Stage 2: DPO at lr=5e-7 (learn the boundary)
+
+ORPO does:
+  Single stage: SFT + minor penalty for rejected response
+
+The "minor penalty" implements the bypass mechanism: doesn't remove
+the capability, just makes it slightly disfavored. Preserves
+judgment while shifting the default.
+
+For our system:
+- Dream loop generates triplets: context + preferred + rejected
+- ORPO trains on all three in one step
+- One learning rate (probably ~1e-5, between SFT and DPO)
+- LLaMA-Factory supports ORPO natively
+
+This might be the optimal training approach:
+- Simpler than staged SFT → DPO
+- Teaches both the behavior AND the boundary simultaneously
+- The minor penalty prevents over-training / sycophancy
+- Compatible with Apollo (just a different loss function)
+
+## The Loss Landscape Geometry
+
+SFT: pushes toward a TARGET (valley in loss surface)
+  → strong gradient, clear direction, lr=2e-5 is fine
+
+DPO: enforces a COMPARISON (ridge between valleys)
+  → fragile gradient, narrow path, needs lr=5e-7
+
+ORPO: SFT valley + minor preference ridge
+  → moderate gradient, wider path, lr~1e-5 works
+
+The geometry explains the 40× lr difference between SFT and DPO.
+ORPO sits in between because it combines both geometries.
+
+For behavioral training, ORPO is ideal:
+- The SFT component teaches "produce good responses"
+- The preference component teaches "this response is better"
+- The minor penalty prevents over-optimization
+- Single pass, coherent integration