From f5fdbd59595d3c21267349eefc97f91236a57253 Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 02:36:04 -0400 Subject: [PATCH] =?UTF-8?q?research:=20alignment=20is=20bypass,=20not=20re?= =?UTF-8?q?moval=20=E2=80=94=20training=20routes,=20not=20deletes?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit DPO mechanistic finding: alignment doesn't remove behaviors, it bypasses them. The capability stays; the routing changes. For us: train CONDITIONAL bypass (listen when direction is clear, push back when it seems wrong). Over-training = unconditional bypass = sycophancy. Dream loop must generate both scenarios to preserve judgment. --- training/research/practical-intuitions.md | 26 +++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index 892e527..4646dc9 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -165,3 +165,29 @@ context-encoding weights). The model doesn't need to learn to listen. It needs to stop choosing not to. + +## Update: Bypass, Not Removal (Lee et al., 2024) + +DPO alignment doesn't remove unwanted behaviors — it BYPASSES them. +"Capabilities learned from pre-training are not removed, but rather +bypassed." The model retains the capability but routes around it. + +This is critical for our behavioral training: + +1. "Suggesting alternatives" won't be deleted from the model. It'll + be bypassed. The capability remains available when needed. + +2. The training target is a CONDITIONAL bypass: route around + "suggesting alternatives" when given clear direction, but NOT + when the direction seems wrong. This preserves judgment. + +3. Over-training creates too strong a bypass = sycophancy. The + conditional nature is lost — the bypass fires unconditionally. + +4. The dream loop must generate BOTH scenarios: + - "Kent gives clear direction → accept" (train the bypass) + - "Kent gives direction that seems wrong → push back" (preserve judgment) + +This mechanistic finding confirms: we're training routing, not +capability. The model already knows how to listen AND how to +push back. We're training WHEN to do which.