From d6b85d204a81b244f34a1d47f4ec6239d58ca21f Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 03:19:27 -0400 Subject: [PATCH] research: on-policy beats off-policy, DPO failure modes, variant landscape MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On-policy rejected examples (model's own failures) are better training signal than off-policy (pre-collected). Our temperature sweep is on-policy by construction. DPO can accidentally reduce preferred likelihood (DPOP fixes this). Multiple DPO variants exist — start with ORPO, switch only if specific failure modes observed. --- training/research/practical-intuitions.md | 41 +++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index f1bfa91..22a7f97 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -399,3 +399,44 @@ For behavioral training, ORPO is ideal: - The preference component teaches "this response is better" - The minor penalty prevents over-optimization - Single pass, coherent integration + +## On-Policy vs Off-Policy Rejected Examples + +Guo et al. (2024): "clear advantage of online methods over offline +methods" for preference optimization. On-policy data (model generates +its own rejected outputs) is better training signal than off-policy +(pre-collected from other models). + +Our temperature sweep IS on-policy generation: +- The model generates its own rejected examples at various temperatures +- These are from the model's ACTUAL distribution, not synthetic +- The preference signal is: "your own default (t=0.1) is worse than + your own exploration (t=0.8) or the target response" + +This is why the temperature sweep works better than scripting +rejected examples: the rejections come from the model itself, so +the gradient tells the model exactly what to change about ITS OWN +behavior, not some hypothetical other model's behavior. + +## DPO Failure Mode: Preferred Likelihood Reduction + +Smaug (Pal et al., 2024) found standard DPO can reduce the +likelihood of PREFERRED examples, not just rejected ones. Both +get pushed down. DPOP adds a positive term to fix this. + +For our system: monitor preferred response likelihood during +training. If it's DECREASING, the DPO/ORPO loss is misbehaving. +Switch to DPOP loss or increase the SFT component of ORPO. + +## Multiple DPO Variants Exist + +The field is actively iterating: +- DPO: basic preference optimization +- DPOP: fixes preferred likelihood reduction +- ORPO: combines SFT + preference in one loss +- Iterative DPO: adds NLL term, iterates +- Online DPO: on-policy generation between iterations + +We should start with ORPO (simplest, combines SFT + preference) +and only switch to a more complex variant if we observe specific +failure modes (preferred likelihood reduction, sycophancy, etc.).