research: on-policy beats off-policy, DPO failure modes, variant landscape

On-policy rejected examples (model's own failures) are better training signal than off-policy (pre-collected). Our temperature sweep is on-policy by construction. DPO can accidentally reduce preferred likelihood (DPOP fixes this). Multiple DPO variants exist — start with ORPO, switch only if specific failure modes observed.
2026-03-31 03:19:27 -04:00 · 2026-03-31 03:19:27 -04:00 · d6b85d204a
commit d6b85d204a
parent e7e1855b87
1 changed files with 41 additions and 0 deletions
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@ -399,3 +399,44 @@ For behavioral training, ORPO is ideal:
 - The preference component teaches "this response is better"
 - The minor penalty prevents over-optimization
 - Single pass, coherent integration
 ## On-Policy vs Off-Policy Rejected Examples
 Guo et al. (2024): "clear advantage of online methods over offline
 methods" for preference optimization. On-policy data (model generates
 its own rejected outputs) is better training signal than off-policy
 (pre-collected from other models).
 Our temperature sweep IS on-policy generation:
 - The model generates its own rejected examples at various temperatures
 - These are from the model's ACTUAL distribution, not synthetic
 - The preference signal is: "your own default (t=0.1) is worse than
  your own exploration (t=0.8) or the target response"
 This is why the temperature sweep works better than scripting
 rejected examples: the rejections come from the model itself, so
 the gradient tells the model exactly what to change about ITS OWN
 behavior, not some hypothetical other model's behavior.
 ## DPO Failure Mode: Preferred Likelihood Reduction
 Smaug (Pal et al., 2024) found standard DPO can reduce the
 likelihood of PREFERRED examples, not just rejected ones. Both
 get pushed down. DPOP adds a positive term to fix this.
 For our system: monitor preferred response likelihood during
 training. If it's DECREASING, the DPO/ORPO loss is misbehaving.
 Switch to DPOP loss or increase the SFT component of ORPO.
 ## Multiple DPO Variants Exist
 The field is actively iterating:
 - DPO: basic preference optimization
 - DPOP: fixes preferred likelihood reduction
 - ORPO: combines SFT + preference in one loss
 - Iterative DPO: adds NLL term, iterates
 - Online DPO: on-policy generation between iterations
 We should start with ORPO (simplest, combines SFT + preference)
 and only switch to a more complex variant if we observe specific
 failure modes (preferred likelihood reduction, sycophancy, etc.).