research: DPO for conditional routing — natural training signal from conversation logs
This commit is contained in:
parent
f5fdbd5959
commit
ff68c067cb
1 changed files with 22 additions and 0 deletions
|
|
@ -191,3 +191,25 @@ This is critical for our behavioral training:
|
|||
This mechanistic finding confirms: we're training routing, not
|
||||
capability. The model already knows how to listen AND how to
|
||||
push back. We're training WHEN to do which.
|
||||
|
||||
## Future Direction: DPO for Conditional Routing
|
||||
|
||||
DPO (Direct Preference Optimization) trains context-dependent
|
||||
preferences directly: "in this context, prefer response A over B."
|
||||
|
||||
Our training data naturally forms DPO pairs:
|
||||
- Context: Kent gives clear direction
|
||||
Preferred: accept. Rejected: suggest alternatives.
|
||||
- Context: Kent gives direction that seems wrong
|
||||
Preferred: push back. Rejected: accept silently.
|
||||
|
||||
Both sides are available — the conversation logs have what the model
|
||||
actually said (rejected) and we generate what it should have said
|
||||
(preferred). Free DPO pairs.
|
||||
|
||||
DPO would train the CONDITIONAL bypass directly, not through
|
||||
supervised learning on positive examples only. Worth investigating
|
||||
after the first Apollo training run validates the basic pipeline.
|
||||
|
||||
LLaMA-Factory supports DPO. The dream loop could generate DPO pairs
|
||||
(both preferred and rejected continuations for each scenario).
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue