10 examples broke safety alignment (Qi et al.). 1000 curated examples matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka). Models 'unlearn arithmetic' when training data lacks it. Predictions: 10-50 examples for measurable change, one epoch, lr=1e-5 to start. Over-training is easy (10 counter-examples undo a disposition). Main risk: sycophancy from narrow training signal. Defense: diverse examples including 'when to push back.' Key intuition: the model doesn't need to learn to listen. It needs to stop choosing not to. |
||
|---|---|---|
| .. | ||
| v0 | ||
| apollo-paper-analysis.md | ||
| context-frozen-training.md | ||
| gdn-gradient-flow.md | ||
| gradient-flow-frozen-context.md | ||
| hogwild-convergence.md | ||
| OPEN-QUESTIONS.md | ||
| practical-intuitions.md | ||
| steering-vectors-bridge.md | ||
| SUMMARY.md | ||
| task-vectors-model-merging.md | ||