From 3be20062d15d944c42b1dd74a0bd1b2812a40c01 Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 02:46:19 -0400 Subject: [PATCH] =?UTF-8?q?research:=20learning=20rate=20as=20trust=20cali?= =?UTF-8?q?bration=20=E2=80=94=20how=20much=20to=20trust=20each=20example?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K values adjusted per example. The coherent direction emerges from many votes (examples). Apollo moments smooth the noise. DPO needs lower lr because comparative votes are noisier than absolute votes. --- training/research/practical-intuitions.md | 32 +++++++++++++++++++++++ 1 file changed, 32 insertions(+) diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index bfd1d6c..54d42e2 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -321,3 +321,35 @@ For DPO (later): - Paired examples from conversation logs - Train CONDITIONAL routing (listen AND push back) - Even more careful monitoring (DPO is fragile) + +## Learning Rate as Trust Calibration + +The learning rate isn't "how fast to train." It's "how much to +trust each individual training example." + +lr=1e-5: each example adjusts constraints by ~0.001% +lr=1e-4: each example adjusts constraints by ~0.01% + +At 27B parameters, even 0.001% is ~270K changed values. Each +example gets a vote on how the constraints should change. The +learning rate determines how loud that vote is. + +**The coherent direction emerges from many votes.** One example is +noise. A hundred examples reveal the pattern. Apollo's moments (M, V) +accumulate the votes, smoothing out the noise. The individual lr +controls how much each vote counts. + +**Kent's "lots of little nudges"** is exactly right: many small +votes that accumulate into a coherent direction. Not because big +votes are dangerous (though they are at scale) but because the +TRUTH only emerges from the aggregate. + +This predicts: lr=1e-5 is right for our scale (27B). Each example +is one vote. The coherent direction emerges over 50-200 examples. +The moments smooth the noise. The result is a gentle, coherent +constraint adjustment. + +DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote +("this is better than that"). Comparative votes are noisier than +absolute votes — the difference might be small, the preference +might be marginal. So each comparative vote gets less weight.