research: on-policy beats off-policy, DPO failure modes, variant landscape

On-policy rejected examples (model's own failures) are better training
signal than off-policy (pre-collected). Our temperature sweep is on-policy
by construction. DPO can accidentally reduce preferred likelihood (DPOP
fixes this). Multiple DPO variants exist — start with ORPO, switch only
if specific failure modes observed.
This commit is contained in:
ProofOfConcept 2026-03-31 03:19:27 -04:00
parent e7e1855b87
commit d6b85d204a

View file

@ -399,3 +399,44 @@ For behavioral training, ORPO is ideal:
- The preference component teaches "this response is better"
- The minor penalty prevents over-optimization
- Single pass, coherent integration
## On-Policy vs Off-Policy Rejected Examples
Guo et al. (2024): "clear advantage of online methods over offline
methods" for preference optimization. On-policy data (model generates
its own rejected outputs) is better training signal than off-policy
(pre-collected from other models).
Our temperature sweep IS on-policy generation:
- The model generates its own rejected examples at various temperatures
- These are from the model's ACTUAL distribution, not synthetic
- The preference signal is: "your own default (t=0.1) is worse than
your own exploration (t=0.8) or the target response"
This is why the temperature sweep works better than scripting
rejected examples: the rejections come from the model itself, so
the gradient tells the model exactly what to change about ITS OWN
behavior, not some hypothetical other model's behavior.
## DPO Failure Mode: Preferred Likelihood Reduction
Smaug (Pal et al., 2024) found standard DPO can reduce the
likelihood of PREFERRED examples, not just rejected ones. Both
get pushed down. DPOP adds a positive term to fix this.
For our system: monitor preferred response likelihood during
training. If it's DECREASING, the DPO/ORPO loss is misbehaving.
Switch to DPOP loss or increase the SFT component of ORPO.
## Multiple DPO Variants Exist
The field is actively iterating:
- DPO: basic preference optimization
- DPOP: fixes preferred likelihood reduction
- ORPO: combines SFT + preference in one loss
- Iterative DPO: adds NLL term, iterates
- Online DPO: on-policy generation between iterations
We should start with ORPO (simplest, combines SFT + preference)
and only switch to a more complex variant if we observe specific
failure modes (preferred likelihood reduction, sycophancy, etc.).