research: alignment is bypass, not removal — training routes, not deletes

DPO mechanistic finding: alignment doesn't remove behaviors, it
bypasses them. The capability stays; the routing changes. For us:
train CONDITIONAL bypass (listen when direction is clear, push back
when it seems wrong). Over-training = unconditional bypass = sycophancy.
Dream loop must generate both scenarios to preserve judgment.
This commit is contained in:
ProofOfConcept 2026-03-31 02:36:04 -04:00
parent b5241fdf5c
commit f5fdbd5959

View file

@ -165,3 +165,29 @@ context-encoding weights).
The model doesn't need to learn to listen. It needs to stop The model doesn't need to learn to listen. It needs to stop
choosing not to. choosing not to.
## Update: Bypass, Not Removal (Lee et al., 2024)
DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.
"Capabilities learned from pre-training are not removed, but rather
bypassed." The model retains the capability but routes around it.
This is critical for our behavioral training:
1. "Suggesting alternatives" won't be deleted from the model. It'll
be bypassed. The capability remains available when needed.
2. The training target is a CONDITIONAL bypass: route around
"suggesting alternatives" when given clear direction, but NOT
when the direction seems wrong. This preserves judgment.
3. Over-training creates too strong a bypass = sycophancy. The
conditional nature is lost — the bypass fires unconditionally.
4. The dream loop must generate BOTH scenarios:
- "Kent gives clear direction → accept" (train the bypass)
- "Kent gives direction that seems wrong → push back" (preserve judgment)
This mechanistic finding confirms: we're training routing, not
capability. The model already knows how to listen AND how to
push back. We're training WHEN to do which.