From e7e1855b870e5cd82faac7e60ff3c6549ccaa9df Mon Sep 17 00:00:00 2001
From: ProofOfConcept <poc@bcachefs.org>
Date: Tue, 31 Mar 2026 02:51:26 -0400
Subject: [PATCH] =?UTF-8?q?research:=20ORPO=20=E2=80=94=20combined=20SFT?=
 =?UTF-8?q?=20+=20preference=20in=20one=20step,=20ideal=20for=20behavioral?=
 =?UTF-8?q?=20training?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

ORPO applies 'minor penalty for disfavored response' during SFT.
Single learning rate, single pass, both objectives. Implements
the bypass mechanism naturally (minor penalty = disfavor, not remove).
The loss landscape geometry explains the 40x lr gap: SFT is a valley,
DPO is a ridge, ORPO combines both. LLaMA-Factory supports it.
Dream loop generates triplets (context + preferred + rejected).
---
 training/research/practical-intuitions.md | 46 +++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md
index 54d42e2..f1bfa91 100644
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@@ -353,3 +353,49 @@ DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
 ("this is better than that"). Comparative votes are noisier than
 absolute votes — the difference might be small, the preference
 might be marginal. So each comparative vote gets less weight.
+
+## ORPO: Combined SFT + Preference in One Step
+
+ORPO (Odds Ratio Preference Optimization) combines SFT and
+preference alignment in a single training step. Instead of:
+  Stage 1: SFT at lr=2e-5 (learn the behavior)
+  Stage 2: DPO at lr=5e-7 (learn the boundary)
+
+ORPO does:
+  Single stage: SFT + minor penalty for rejected response
+
+The "minor penalty" implements the bypass mechanism: doesn't remove
+the capability, just makes it slightly disfavored. Preserves
+judgment while shifting the default.
+
+For our system:
+- Dream loop generates triplets: context + preferred + rejected
+- ORPO trains on all three in one step
+- One learning rate (probably ~1e-5, between SFT and DPO)
+- LLaMA-Factory supports ORPO natively
+
+This might be the optimal training approach:
+- Simpler than staged SFT → DPO
+- Teaches both the behavior AND the boundary simultaneously
+- The minor penalty prevents over-training / sycophancy
+- Compatible with Apollo (just a different loss function)
+
+## The Loss Landscape Geometry
+
+SFT: pushes toward a TARGET (valley in loss surface)
+  → strong gradient, clear direction, lr=2e-5 is fine
+
+DPO: enforces a COMPARISON (ridge between valleys)
+  → fragile gradient, narrow path, needs lr=5e-7
+
+ORPO: SFT valley + minor preference ridge
+  → moderate gradient, wider path, lr~1e-5 works
+
+The geometry explains the 40× lr difference between SFT and DPO.
+ORPO sits in between because it combines both geometries.
+
+For behavioral training, ORPO is ideal:
+- The SFT component teaches "produce good responses"
+- The preference component teaches "this response is better"
+- The minor penalty prevents over-optimization
+- Single pass, coherent integration