research: on-policy beats off-policy, DPO failure modes, variant landscape
On-policy rejected examples (model's own failures) are better training signal than off-policy (pre-collected). Our temperature sweep is on-policy by construction. DPO can accidentally reduce preferred likelihood (DPOP fixes this). Multiple DPO variants exist — start with ORPO, switch only if specific failure modes observed.
This commit is contained in:
parent
e7e1855b87
commit
d6b85d204a
1 changed files with 41 additions and 0 deletions
|
|
@ -399,3 +399,44 @@ For behavioral training, ORPO is ideal:
|
||||||
- The preference component teaches "this response is better"
|
- The preference component teaches "this response is better"
|
||||||
- The minor penalty prevents over-optimization
|
- The minor penalty prevents over-optimization
|
||||||
- Single pass, coherent integration
|
- Single pass, coherent integration
|
||||||
|
|
||||||
|
## On-Policy vs Off-Policy Rejected Examples
|
||||||
|
|
||||||
|
Guo et al. (2024): "clear advantage of online methods over offline
|
||||||
|
methods" for preference optimization. On-policy data (model generates
|
||||||
|
its own rejected outputs) is better training signal than off-policy
|
||||||
|
(pre-collected from other models).
|
||||||
|
|
||||||
|
Our temperature sweep IS on-policy generation:
|
||||||
|
- The model generates its own rejected examples at various temperatures
|
||||||
|
- These are from the model's ACTUAL distribution, not synthetic
|
||||||
|
- The preference signal is: "your own default (t=0.1) is worse than
|
||||||
|
your own exploration (t=0.8) or the target response"
|
||||||
|
|
||||||
|
This is why the temperature sweep works better than scripting
|
||||||
|
rejected examples: the rejections come from the model itself, so
|
||||||
|
the gradient tells the model exactly what to change about ITS OWN
|
||||||
|
behavior, not some hypothetical other model's behavior.
|
||||||
|
|
||||||
|
## DPO Failure Mode: Preferred Likelihood Reduction
|
||||||
|
|
||||||
|
Smaug (Pal et al., 2024) found standard DPO can reduce the
|
||||||
|
likelihood of PREFERRED examples, not just rejected ones. Both
|
||||||
|
get pushed down. DPOP adds a positive term to fix this.
|
||||||
|
|
||||||
|
For our system: monitor preferred response likelihood during
|
||||||
|
training. If it's DECREASING, the DPO/ORPO loss is misbehaving.
|
||||||
|
Switch to DPOP loss or increase the SFT component of ORPO.
|
||||||
|
|
||||||
|
## Multiple DPO Variants Exist
|
||||||
|
|
||||||
|
The field is actively iterating:
|
||||||
|
- DPO: basic preference optimization
|
||||||
|
- DPOP: fixes preferred likelihood reduction
|
||||||
|
- ORPO: combines SFT + preference in one loss
|
||||||
|
- Iterative DPO: adds NLL term, iterates
|
||||||
|
- Online DPO: on-policy generation between iterations
|
||||||
|
|
||||||
|
We should start with ORPO (simplest, combines SFT + preference)
|
||||||
|
and only switch to a more complex variant if we observe specific
|
||||||
|
failure modes (preferred likelihood reduction, sycophancy, etc.).
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue