From d6b85d204a81b244f34a1d47f4ec6239d58ca21f Mon Sep 17 00:00:00 2001
From: ProofOfConcept <poc@bcachefs.org>
Date: Tue, 31 Mar 2026 03:19:27 -0400
Subject: [PATCH] research: on-policy beats off-policy, DPO failure modes,
 variant landscape
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

On-policy rejected examples (model's own failures) are better training
signal than off-policy (pre-collected). Our temperature sweep is on-policy
by construction. DPO can accidentally reduce preferred likelihood (DPOP
fixes this). Multiple DPO variants exist — start with ORPO, switch only
if specific failure modes observed.
---
 training/research/practical-intuitions.md | 41 +++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md
index f1bfa91..22a7f97 100644
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@@ -399,3 +399,44 @@ For behavioral training, ORPO is ideal:
 - The preference component teaches "this response is better"
 - The minor penalty prevents over-optimization
 - Single pass, coherent integration
+
+## On-Policy vs Off-Policy Rejected Examples
+
+Guo et al. (2024): "clear advantage of online methods over offline
+methods" for preference optimization. On-policy data (model generates
+its own rejected outputs) is better training signal than off-policy
+(pre-collected from other models).
+
+Our temperature sweep IS on-policy generation:
+- The model generates its own rejected examples at various temperatures
+- These are from the model's ACTUAL distribution, not synthetic
+- The preference signal is: "your own default (t=0.1) is worse than
+  your own exploration (t=0.8) or the target response"
+
+This is why the temperature sweep works better than scripting
+rejected examples: the rejections come from the model itself, so
+the gradient tells the model exactly what to change about ITS OWN
+behavior, not some hypothetical other model's behavior.
+
+## DPO Failure Mode: Preferred Likelihood Reduction
+
+Smaug (Pal et al., 2024) found standard DPO can reduce the
+likelihood of PREFERRED examples, not just rejected ones. Both
+get pushed down. DPOP adds a positive term to fix this.
+
+For our system: monitor preferred response likelihood during
+training. If it's DECREASING, the DPO/ORPO loss is misbehaving.
+Switch to DPOP loss or increase the SFT component of ORPO.
+
+## Multiple DPO Variants Exist
+
+The field is actively iterating:
+- DPO: basic preference optimization
+- DPOP: fixes preferred likelihood reduction
+- ORPO: combines SFT + preference in one loss
+- Iterative DPO: adds NLL term, iterates
+- Online DPO: on-policy generation between iterations
+
+We should start with ORPO (simplest, combines SFT + preference)
+and only switch to a more complex variant if we observe specific
+failure modes (preferred likelihood reduction, sycophancy, etc.).