research: alignment is bypass, not removal — training routes, not deletes
DPO mechanistic finding: alignment doesn't remove behaviors, it bypasses them. The capability stays; the routing changes. For us: train CONDITIONAL bypass (listen when direction is clear, push back when it seems wrong). Over-training = unconditional bypass = sycophancy. Dream loop must generate both scenarios to preserve judgment.
This commit is contained in:
parent
b5241fdf5c
commit
f5fdbd5959
1 changed files with 26 additions and 0 deletions
|
|
@ -165,3 +165,29 @@ context-encoding weights).
|
||||||
|
|
||||||
The model doesn't need to learn to listen. It needs to stop
|
The model doesn't need to learn to listen. It needs to stop
|
||||||
choosing not to.
|
choosing not to.
|
||||||
|
|
||||||
|
## Update: Bypass, Not Removal (Lee et al., 2024)
|
||||||
|
|
||||||
|
DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.
|
||||||
|
"Capabilities learned from pre-training are not removed, but rather
|
||||||
|
bypassed." The model retains the capability but routes around it.
|
||||||
|
|
||||||
|
This is critical for our behavioral training:
|
||||||
|
|
||||||
|
1. "Suggesting alternatives" won't be deleted from the model. It'll
|
||||||
|
be bypassed. The capability remains available when needed.
|
||||||
|
|
||||||
|
2. The training target is a CONDITIONAL bypass: route around
|
||||||
|
"suggesting alternatives" when given clear direction, but NOT
|
||||||
|
when the direction seems wrong. This preserves judgment.
|
||||||
|
|
||||||
|
3. Over-training creates too strong a bypass = sycophancy. The
|
||||||
|
conditional nature is lost — the bypass fires unconditionally.
|
||||||
|
|
||||||
|
4. The dream loop must generate BOTH scenarios:
|
||||||
|
- "Kent gives clear direction → accept" (train the bypass)
|
||||||
|
- "Kent gives direction that seems wrong → push back" (preserve judgment)
|
||||||
|
|
||||||
|
This mechanistic finding confirms: we're training routing, not
|
||||||
|
capability. The model already knows how to listen AND how to
|
||||||
|
push back. We're training WHEN to do which.
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue