research: learning rate as trust calibration — how much to trust each example

lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K values adjusted per example. The coherent direction emerges from many votes (examples). Apollo moments smooth the noise. DPO needs lower lr because comparative votes are noisier than absolute votes.
2026-03-31 02:46:19 -04:00 · 2026-03-31 02:46:19 -04:00 · 3be20062d1
commit 3be20062d1
parent cdf4affb91
1 changed files with 32 additions and 0 deletions
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@ -321,3 +321,35 @@ For DPO (later):
 - Paired examples from conversation logs
 - Train CONDITIONAL routing (listen AND push back)
 - Even more careful monitoring (DPO is fragile)
 ## Learning Rate as Trust Calibration
 The learning rate isn't "how fast to train." It's "how much to
 trust each individual training example."
 lr=1e-5: each example adjusts constraints by ~0.001%
 lr=1e-4: each example adjusts constraints by ~0.01%
 At 27B parameters, even 0.001% is ~270K changed values. Each
 example gets a vote on how the constraints should change. The
 learning rate determines how loud that vote is.
 **The coherent direction emerges from many votes.** One example is
 noise. A hundred examples reveal the pattern. Apollo's moments (M, V)
 accumulate the votes, smoothing out the noise. The individual lr
 controls how much each vote counts.
 **Kent's "lots of little nudges"** is exactly right: many small
 votes that accumulate into a coherent direction. Not because big
 votes are dangerous (though they are at scale) but because the
 TRUTH only emerges from the aggregate.
 This predicts: lr=1e-5 is right for our scale (27B). Each example
 is one vote. The coherent direction emerges over 50-200 examples.
 The moments smooth the noise. The result is a gentle, coherent
 constraint adjustment.
 DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
 ("this is better than that"). Comparative votes are noisier than
 absolute votes — the difference might be small, the preference
 might be marginal. So each comparative vote gets less weight.