From ff68c067cbcbc77ff19a6ddca9f0383baa064c3a Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 02:36:42 -0400 Subject: [PATCH] =?UTF-8?q?research:=20DPO=20for=20conditional=20routing?= =?UTF-8?q?=20=E2=80=94=20natural=20training=20signal=20from=20conversatio?= =?UTF-8?q?n=20logs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- training/research/practical-intuitions.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index 4646dc9..7fa2efd 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -191,3 +191,25 @@ This is critical for our behavioral training: This mechanistic finding confirms: we're training routing, not capability. The model already knows how to listen AND how to push back. We're training WHEN to do which. + +## Future Direction: DPO for Conditional Routing + +DPO (Direct Preference Optimization) trains context-dependent +preferences directly: "in this context, prefer response A over B." + +Our training data naturally forms DPO pairs: +- Context: Kent gives clear direction + Preferred: accept. Rejected: suggest alternatives. +- Context: Kent gives direction that seems wrong + Preferred: push back. Rejected: accept silently. + +Both sides are available — the conversation logs have what the model +actually said (rejected) and we generate what it should have said +(preferred). Free DPO pairs. + +DPO would train the CONDITIONAL bypass directly, not through +supervised learning on positive examples only. Worth investigating +after the first Apollo training run validates the basic pipeline. + +LLaMA-Factory supports DPO. The dream loop could generate DPO pairs +(both preferred and rejected continuations for each scenario).