diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md
index f1bfa91..22a7f97 100644
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@@ -399,3 +399,44 @@ For behavioral training, ORPO is ideal:
 - The preference component teaches "this response is better"
 - The minor penalty prevents over-optimization
 - Single pass, coherent integration
+
+## On-Policy vs Off-Policy Rejected Examples
+
+Guo et al. (2024): "clear advantage of online methods over offline
+methods" for preference optimization. On-policy data (model generates
+its own rejected outputs) is better training signal than off-policy
+(pre-collected from other models).
+
+Our temperature sweep IS on-policy generation:
+- The model generates its own rejected examples at various temperatures
+- These are from the model's ACTUAL distribution, not synthetic
+- The preference signal is: "your own default (t=0.1) is worse than
+  your own exploration (t=0.8) or the target response"
+
+This is why the temperature sweep works better than scripting
+rejected examples: the rejections come from the model itself, so
+the gradient tells the model exactly what to change about ITS OWN
+behavior, not some hypothetical other model's behavior.
+
+## DPO Failure Mode: Preferred Likelihood Reduction
+
+Smaug (Pal et al., 2024) found standard DPO can reduce the
+likelihood of PREFERRED examples, not just rejected ones. Both
+get pushed down. DPOP adds a positive term to fix this.
+
+For our system: monitor preferred response likelihood during
+training. If it's DECREASING, the DPO/ORPO loss is misbehaving.
+Switch to DPOP loss or increase the SFT component of ORPO.
+
+## Multiple DPO Variants Exist
+
+The field is actively iterating:
+- DPO: basic preference optimization
+- DPOP: fixes preferred likelihood reduction
+- ORPO: combines SFT + preference in one loss
+- Iterative DPO: adds NLL term, iterates
+- Online DPO: on-policy generation between iterations
+
+We should start with ORPO (simplest, combines SFT + preference)
+and only switch to a more complex variant if we observe specific
+failure modes (preferred likelihood reduction, sycophancy, etc.).