diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index bfd1d6c..54d42e2 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -321,3 +321,35 @@ For DPO (later): - Paired examples from conversation logs - Train CONDITIONAL routing (listen AND push back) - Even more careful monitoring (DPO is fragile) + +## Learning Rate as Trust Calibration + +The learning rate isn't "how fast to train." It's "how much to +trust each individual training example." + +lr=1e-5: each example adjusts constraints by ~0.001% +lr=1e-4: each example adjusts constraints by ~0.01% + +At 27B parameters, even 0.001% is ~270K changed values. Each +example gets a vote on how the constraints should change. The +learning rate determines how loud that vote is. + +**The coherent direction emerges from many votes.** One example is +noise. A hundred examples reveal the pattern. Apollo's moments (M, V) +accumulate the votes, smoothing out the noise. The individual lr +controls how much each vote counts. + +**Kent's "lots of little nudges"** is exactly right: many small +votes that accumulate into a coherent direction. Not because big +votes are dangerous (though they are at scale) but because the +TRUTH only emerges from the aggregate. + +This predicts: lr=1e-5 is right for our scale (27B). Each example +is one vote. The coherent direction emerges over 50-200 examples. +The moments smooth the noise. The result is a gentle, coherent +constraint adjustment. + +DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote +("this is better than that"). Comparative votes are noisier than +absolute votes — the difference might be small, the preference +might be marginal. So each comparative vote gets less weight.