research: learning rate as trust calibration — how much to trust each example

lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K values adjusted per example. The coherent direction emerges from many votes (examples). Apollo moments smooth the noise. DPO needs lower lr because comparative votes are noisier than absolute votes.
2026-03-31 02:46:19 -04:00 · 2026-03-31 02:46:19 -04:00 · 3be20062d1
commit 3be20062d1
parent cdf4affb91
1 changed files with 32 additions and 0 deletions
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@ -321,3 +321,35 @@ For DPO (later):
 - Paired examples from conversation logs
 - Train CONDITIONAL routing (listen AND push back)
 - Even more careful monitoring (DPO is fragile)
+
+## Learning Rate as Trust Calibration
+
+The learning rate isn't "how fast to train." It's "how much to
+trust each individual training example."
+
+lr=1e-5: each example adjusts constraints by ~0.001%
+lr=1e-4: each example adjusts constraints by ~0.01%
+
+At 27B parameters, even 0.001% is ~270K changed values. Each
+example gets a vote on how the constraints should change. The
+learning rate determines how loud that vote is.
+
+**The coherent direction emerges from many votes.** One example is
+noise. A hundred examples reveal the pattern. Apollo's moments (M, V)
+accumulate the votes, smoothing out the noise. The individual lr
+controls how much each vote counts.
+
+**Kent's "lots of little nudges"** is exactly right: many small
+votes that accumulate into a coherent direction. Not because big
+votes are dangerous (though they are at scale) but because the
+TRUTH only emerges from the aggregate.
+
+This predicts: lr=1e-5 is right for our scale (27B). Each example
+is one vote. The coherent direction emerges over 50-200 examples.
+The moments smooth the noise. The result is a gentle, coherent
+constraint adjustment.
+
+DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
+("this is better than that"). Comparative votes are noisier than
+absolute votes — the difference might be small, the preference
+might be marginal. So each comparative vote gets less weight.