diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md
index 54d42e2..f1bfa91 100644
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@@ -353,3 +353,49 @@ DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
 ("this is better than that"). Comparative votes are noisier than
 absolute votes — the difference might be small, the preference
 might be marginal. So each comparative vote gets less weight.
+
+## ORPO: Combined SFT + Preference in One Step
+
+ORPO (Odds Ratio Preference Optimization) combines SFT and
+preference alignment in a single training step. Instead of:
+  Stage 1: SFT at lr=2e-5 (learn the behavior)
+  Stage 2: DPO at lr=5e-7 (learn the boundary)
+
+ORPO does:
+  Single stage: SFT + minor penalty for rejected response
+
+The "minor penalty" implements the bypass mechanism: doesn't remove
+the capability, just makes it slightly disfavored. Preserves
+judgment while shifting the default.
+
+For our system:
+- Dream loop generates triplets: context + preferred + rejected
+- ORPO trains on all three in one step
+- One learning rate (probably ~1e-5, between SFT and DPO)
+- LLaMA-Factory supports ORPO natively
+
+This might be the optimal training approach:
+- Simpler than staged SFT → DPO
+- Teaches both the behavior AND the boundary simultaneously
+- The minor penalty prevents over-training / sycophancy
+- Compatible with Apollo (just a different loss function)
+
+## The Loss Landscape Geometry
+
+SFT: pushes toward a TARGET (valley in loss surface)
+  → strong gradient, clear direction, lr=2e-5 is fine
+
+DPO: enforces a COMPARISON (ridge between valleys)
+  → fragile gradient, narrow path, needs lr=5e-7
+
+ORPO: SFT valley + minor preference ridge
+  → moderate gradient, wider path, lr~1e-5 works
+
+The geometry explains the 40× lr difference between SFT and DPO.
+ORPO sits in between because it combines both geometries.
+
+For behavioral training, ORPO is ideal:
+- The SFT component teaches "produce good responses"
+- The preference component teaches "this response is better"
+- The minor penalty prevents over-optimization
+- Single pass, coherent integration