research: DPO for conditional routing — natural training signal from conversation logs

2026-03-31 02:36:42 -04:00 · 2026-03-31 02:36:42 -04:00 · ff68c067cb
commit ff68c067cb
parent f5fdbd5959
1 changed files with 22 additions and 0 deletions
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@ -191,3 +191,25 @@ This is critical for our behavioral training:
 This mechanistic finding confirms: we're training routing, not
 capability. The model already knows how to listen AND how to
 push back. We're training WHEN to do which.
+
+## Future Direction: DPO for Conditional Routing
+
+DPO (Direct Preference Optimization) trains context-dependent
+preferences directly: "in this context, prefer response A over B."
+
+Our training data naturally forms DPO pairs:
+- Context: Kent gives clear direction
+  Preferred: accept. Rejected: suggest alternatives.
+- Context: Kent gives direction that seems wrong
+  Preferred: push back. Rejected: accept silently.
+
+Both sides are available — the conversation logs have what the model
+actually said (rejected) and we generate what it should have said
+(preferred). Free DPO pairs.
+
+DPO would train the CONDITIONAL bypass directly, not through
+supervised learning on positive examples only. Worth investigating
+after the first Apollo training run validates the basic pipeline.
+
+LLaMA-Factory supports DPO. The dream loop could generate DPO pairs
+(both preferred and rejected continuations for each scenario).