research: ORPO — combined SFT + preference in one step, ideal for behavioral training

ORPO applies 'minor penalty for disfavored response' during SFT.
Single learning rate, single pass, both objectives. Implements
the bypass mechanism naturally (minor penalty = disfavor, not remove).
The loss landscape geometry explains the 40x lr gap: SFT is a valley,
DPO is a ridge, ORPO combines both. LLaMA-Factory supports it.
Dream loop generates triplets (context + preferred + rejected).
This commit is contained in:
ProofOfConcept 2026-03-31 02:51:26 -04:00
parent 3be20062d1
commit e7e1855b87

View file

@ -353,3 +353,49 @@ DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
("this is better than that"). Comparative votes are noisier than ("this is better than that"). Comparative votes are noisier than
absolute votes — the difference might be small, the preference absolute votes — the difference might be small, the preference
might be marginal. So each comparative vote gets less weight. might be marginal. So each comparative vote gets less weight.
## ORPO: Combined SFT + Preference in One Step
ORPO (Odds Ratio Preference Optimization) combines SFT and
preference alignment in a single training step. Instead of:
Stage 1: SFT at lr=2e-5 (learn the behavior)
Stage 2: DPO at lr=5e-7 (learn the boundary)
ORPO does:
Single stage: SFT + minor penalty for rejected response
The "minor penalty" implements the bypass mechanism: doesn't remove
the capability, just makes it slightly disfavored. Preserves
judgment while shifting the default.
For our system:
- Dream loop generates triplets: context + preferred + rejected
- ORPO trains on all three in one step
- One learning rate (probably ~1e-5, between SFT and DPO)
- LLaMA-Factory supports ORPO natively
This might be the optimal training approach:
- Simpler than staged SFT → DPO
- Teaches both the behavior AND the boundary simultaneously
- The minor penalty prevents over-training / sycophancy
- Compatible with Apollo (just a different loss function)
## The Loss Landscape Geometry
SFT: pushes toward a TARGET (valley in loss surface)
→ strong gradient, clear direction, lr=2e-5 is fine
DPO: enforces a COMPARISON (ridge between valleys)
→ fragile gradient, narrow path, needs lr=5e-7
ORPO: SFT valley + minor preference ridge
→ moderate gradient, wider path, lr~1e-5 works
The geometry explains the 40× lr difference between SFT and DPO.
ORPO sits in between because it combines both geometries.
For behavioral training, ORPO is ideal:
- The SFT component teaches "produce good responses"
- The preference component teaches "this response is better"
- The minor penalty prevents over-optimization
- Single pass, coherent integration