diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index f1bfa91..22a7f97 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -399,3 +399,44 @@ For behavioral training, ORPO is ideal: - The preference component teaches "this response is better" - The minor penalty prevents over-optimization - Single pass, coherent integration + +## On-Policy vs Off-Policy Rejected Examples + +Guo et al. (2024): "clear advantage of online methods over offline +methods" for preference optimization. On-policy data (model generates +its own rejected outputs) is better training signal than off-policy +(pre-collected from other models). + +Our temperature sweep IS on-policy generation: +- The model generates its own rejected examples at various temperatures +- These are from the model's ACTUAL distribution, not synthetic +- The preference signal is: "your own default (t=0.1) is worse than + your own exploration (t=0.8) or the target response" + +This is why the temperature sweep works better than scripting +rejected examples: the rejections come from the model itself, so +the gradient tells the model exactly what to change about ITS OWN +behavior, not some hypothetical other model's behavior. + +## DPO Failure Mode: Preferred Likelihood Reduction + +Smaug (Pal et al., 2024) found standard DPO can reduce the +likelihood of PREFERRED examples, not just rejected ones. Both +get pushed down. DPOP adds a positive term to fix this. + +For our system: monitor preferred response likelihood during +training. If it's DECREASING, the DPO/ORPO loss is misbehaving. +Switch to DPOP loss or increase the SFT component of ORPO. + +## Multiple DPO Variants Exist + +The field is actively iterating: +- DPO: basic preference optimization +- DPOP: fixes preferred likelihood reduction +- ORPO: combines SFT + preference in one loss +- Iterative DPO: adds NLL term, iterates +- Online DPO: on-policy generation between iterations + +We should start with ORPO (simplest, combines SFT + preference) +and only switch to a more complex variant if we observe specific +failure modes (preferred likelihood reduction, sycophancy, etc.).