lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K values adjusted per example. The coherent direction emerges from many votes (examples). Apollo moments smooth the noise. DPO needs lower lr because comparative votes are noisier than absolute votes. |
||
|---|---|---|
| .. | ||
| v0 | ||
| apollo-paper-analysis.md | ||
| context-frozen-training.md | ||
| gdn-gradient-flow.md | ||
| gradient-flow-frozen-context.md | ||
| hogwild-convergence.md | ||
| OPEN-QUESTIONS.md | ||
| practical-intuitions.md | ||
| steering-vectors-bridge.md | ||
| SUMMARY.md | ||
| task-vectors-model-merging.md | ||