ORPO applies 'minor penalty for disfavored response' during SFT. Single learning rate, single pass, both objectives. Implements the bypass mechanism naturally (minor penalty = disfavor, not remove). The loss landscape geometry explains the 40x lr gap: SFT is a valley, DPO is a ridge, ORPO combines both. LLaMA-Factory supports it. Dream loop generates triplets (context + preferred + rejected). |
||
|---|---|---|
| .. | ||
| v0 | ||
| apollo-paper-analysis.md | ||
| context-frozen-training.md | ||
| gdn-gradient-flow.md | ||
| gradient-flow-frozen-context.md | ||
| hogwild-convergence.md | ||
| OPEN-QUESTIONS.md | ||
| practical-intuitions.md | ||
| steering-vectors-bridge.md | ||
| SUMMARY.md | ||
| task-vectors-model-merging.md | ||