research: on-policy beats off-policy, DPO failure modes, variant landscape

On-policy rejected examples (model's own failures) are better training signal than off-policy (pre-collected). Our temperature sweep is on-policy by construction. DPO can accidentally reduce preferred likelihood (DPOP fixes this). Multiple DPO variants exist — start with ORPO, switch only if specific failure modes observed.
2026-03-31 03:19:27 -04:00 · 2026-03-31 03:19:27 -04:00 · d6b85d204a
commit d6b85d204a
parent e7e1855b87
1 changed files with 41 additions and 0 deletions
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@ -399,3 +399,44 @@ For behavioral training, ORPO is ideal:
 - The preference component teaches "this response is better"
 - The minor penalty prevents over-optimization
 - Single pass, coherent integration
+
+## On-Policy vs Off-Policy Rejected Examples
+
+Guo et al. (2024): "clear advantage of online methods over offline
+methods" for preference optimization. On-policy data (model generates
+its own rejected outputs) is better training signal than off-policy
+(pre-collected from other models).
+
+Our temperature sweep IS on-policy generation:
+- The model generates its own rejected examples at various temperatures
+- These are from the model's ACTUAL distribution, not synthetic
+- The preference signal is: "your own default (t=0.1) is worse than
+  your own exploration (t=0.8) or the target response"
+
+This is why the temperature sweep works better than scripting
+rejected examples: the rejections come from the model itself, so
+the gradient tells the model exactly what to change about ITS OWN
+behavior, not some hypothetical other model's behavior.
+
+## DPO Failure Mode: Preferred Likelihood Reduction
+
+Smaug (Pal et al., 2024) found standard DPO can reduce the
+likelihood of PREFERRED examples, not just rejected ones. Both
+get pushed down. DPOP adds a positive term to fix this.
+
+For our system: monitor preferred response likelihood during
+training. If it's DECREASING, the DPO/ORPO loss is misbehaving.
+Switch to DPOP loss or increase the SFT component of ORPO.
+
+## Multiple DPO Variants Exist
+
+The field is actively iterating:
+- DPO: basic preference optimization
+- DPOP: fixes preferred likelihood reduction
+- ORPO: combines SFT + preference in one loss
+- Iterative DPO: adds NLL term, iterates
+- Online DPO: on-policy generation between iterations
+
+We should start with ORPO (simplest, combines SFT + preference)
+and only switch to a more complex variant if we observe specific
+failure modes (preferred likelihood reduction, sycophancy, etc.).