DPO mechanistic finding: alignment doesn't remove behaviors, it
bypasses them. The capability stays; the routing changes. For us:
train CONDITIONAL bypass (listen when direction is clear, push back
when it seems wrong). Over-training = unconditional bypass = sycophancy.
Dream loop must generate both scenarios to preserve judgment.
10 examples broke safety alignment (Qi et al.). 1000 curated examples
matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka).
Models 'unlearn arithmetic' when training data lacks it.
Predictions: 10-50 examples for measurable change, one epoch,
lr=1e-5 to start. Over-training is easy (10 counter-examples undo
a disposition). Main risk: sycophancy from narrow training signal.
Defense: diverse examples including 'when to push back.'
Key intuition: the model doesn't need to learn to listen. It needs
to stop choosing not to.