Commit graph

1 commit

Author SHA1 Message Date
ProofOfConcept
b5241fdf5c research: practical intuitions — what will actually happen when we train
10 examples broke safety alignment (Qi et al.). 1000 curated examples
matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka).
Models 'unlearn arithmetic' when training data lacks it.

Predictions: 10-50 examples for measurable change, one epoch,
lr=1e-5 to start. Over-training is easy (10 counter-examples undo
a disposition). Main risk: sycophancy from narrow training signal.
Defense: diverse examples including 'when to push back.'

Key intuition: the model doesn't need to learn to listen. It needs
to stop choosing not to.
2026-03-31 02:35:03 -04:00