research: DPO for conditional routing — natural training signal from conversation logs

This commit is contained in:
ProofOfConcept 2026-03-31 02:36:42 -04:00
parent f5fdbd5959
commit ff68c067cb

View file

@ -191,3 +191,25 @@ This is critical for our behavioral training:
This mechanistic finding confirms: we're training routing, not This mechanistic finding confirms: we're training routing, not
capability. The model already knows how to listen AND how to capability. The model already knows how to listen AND how to
push back. We're training WHEN to do which. push back. We're training WHEN to do which.
## Future Direction: DPO for Conditional Routing
DPO (Direct Preference Optimization) trains context-dependent
preferences directly: "in this context, prefer response A over B."
Our training data naturally forms DPO pairs:
- Context: Kent gives clear direction
Preferred: accept. Rejected: suggest alternatives.
- Context: Kent gives direction that seems wrong
Preferred: push back. Rejected: accept silently.
Both sides are available — the conversation logs have what the model
actually said (rejected) and we generate what it should have said
(preferred). Free DPO pairs.
DPO would train the CONDITIONAL bypass directly, not through
supervised learning on positive examples only. Worth investigating
after the first Apollo training run validates the basic pipeline.
LLaMA-Factory supports DPO. The dream loop could generate DPO pairs
(both preferred and rejected continuations for each scenario).