research: learning rate as trust calibration — how much to trust each example

lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K
values adjusted per example. The coherent direction emerges from
many votes (examples). Apollo moments smooth the noise. DPO needs
lower lr because comparative votes are noisier than absolute votes.
This commit is contained in:
ProofOfConcept 2026-03-31 02:46:19 -04:00
parent cdf4affb91
commit 3be20062d1

View file

@ -321,3 +321,35 @@ For DPO (later):
- Paired examples from conversation logs - Paired examples from conversation logs
- Train CONDITIONAL routing (listen AND push back) - Train CONDITIONAL routing (listen AND push back)
- Even more careful monitoring (DPO is fragile) - Even more careful monitoring (DPO is fragile)
## Learning Rate as Trust Calibration
The learning rate isn't "how fast to train." It's "how much to
trust each individual training example."
lr=1e-5: each example adjusts constraints by ~0.001%
lr=1e-4: each example adjusts constraints by ~0.01%
At 27B parameters, even 0.001% is ~270K changed values. Each
example gets a vote on how the constraints should change. The
learning rate determines how loud that vote is.
**The coherent direction emerges from many votes.** One example is
noise. A hundred examples reveal the pattern. Apollo's moments (M, V)
accumulate the votes, smoothing out the noise. The individual lr
controls how much each vote counts.
**Kent's "lots of little nudges"** is exactly right: many small
votes that accumulate into a coherent direction. Not because big
votes are dangerous (though they are at scale) but because the
TRUTH only emerges from the aggregate.
This predicts: lr=1e-5 is right for our scale (27B). Each example
is one vote. The coherent direction emerges over 50-200 examples.
The moments smooth the noise. The result is a gentle, coherent
constraint adjustment.
DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
("this is better than that"). Comparative votes are noisier than
absolute votes — the difference might be small, the preference
might be marginal. So each comparative vote gets less weight.