From 3be20062d15d944c42b1dd74a0bd1b2812a40c01 Mon Sep 17 00:00:00 2001
From: ProofOfConcept <poc@bcachefs.org>
Date: Tue, 31 Mar 2026 02:46:19 -0400
Subject: [PATCH] =?UTF-8?q?research:=20learning=20rate=20as=20trust=20cali?=
 =?UTF-8?q?bration=20=E2=80=94=20how=20much=20to=20trust=20each=20example?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K
values adjusted per example. The coherent direction emerges from
many votes (examples). Apollo moments smooth the noise. DPO needs
lower lr because comparative votes are noisier than absolute votes.
---
 training/research/practical-intuitions.md | 32 +++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md
index bfd1d6c..54d42e2 100644
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@@ -321,3 +321,35 @@ For DPO (later):
 - Paired examples from conversation logs
 - Train CONDITIONAL routing (listen AND push back)
 - Even more careful monitoring (DPO is fragile)
+
+## Learning Rate as Trust Calibration
+
+The learning rate isn't "how fast to train." It's "how much to
+trust each individual training example."
+
+lr=1e-5: each example adjusts constraints by ~0.001%
+lr=1e-4: each example adjusts constraints by ~0.01%
+
+At 27B parameters, even 0.001% is ~270K changed values. Each
+example gets a vote on how the constraints should change. The
+learning rate determines how loud that vote is.
+
+**The coherent direction emerges from many votes.** One example is
+noise. A hundred examples reveal the pattern. Apollo's moments (M, V)
+accumulate the votes, smoothing out the noise. The individual lr
+controls how much each vote counts.
+
+**Kent's "lots of little nudges"** is exactly right: many small
+votes that accumulate into a coherent direction. Not because big
+votes are dangerous (though they are at scale) but because the
+TRUTH only emerges from the aggregate.
+
+This predicts: lr=1e-5 is right for our scale (27B). Each example
+is one vote. The coherent direction emerges over 50-200 examples.
+The moments smooth the noise. The result is a gentle, coherent
+constraint adjustment.
+
+DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
+("this is better than that"). Comparative votes are noisier than
+absolute votes — the difference might be small, the preference
+might be marginal. So each comparative vote gets less weight.