research: DPO for conditional routing — natural training signal from conversation logs

2026-03-31 02:36:42 -04:00 · 2026-03-31 02:36:42 -04:00 · ff68c067cb
commit ff68c067cb
parent f5fdbd5959
1 changed files with 22 additions and 0 deletions
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@ -191,3 +191,25 @@ This is critical for our behavioral training:
 This mechanistic finding confirms: we're training routing, not
 capability. The model already knows how to listen AND how to
 push back. We're training WHEN to do which.
 ## Future Direction: DPO for Conditional Routing
 DPO (Direct Preference Optimization) trains context-dependent
 preferences directly: "in this context, prefer response A over B."
 Our training data naturally forms DPO pairs:
 - Context: Kent gives clear direction
  Preferred: accept. Rejected: suggest alternatives.
 - Context: Kent gives direction that seems wrong
  Preferred: push back. Rejected: accept silently.
 Both sides are available — the conversation logs have what the model
 actually said (rejected) and we generate what it should have said
 (preferred). Free DPO pairs.
 DPO would train the CONDITIONAL bypass directly, not through
 supervised learning on positive examples only. Worth investigating
 after the first Apollo training run validates the basic pipeline.
 LLaMA-Factory supports DPO. The dream loop could generate DPO pairs
 (both preferred and rejected continuations for each scenario).