Commit graph

6 commits

Author SHA1 Message Date
ProofOfConcept
3be20062d1 research: learning rate as trust calibration — how much to trust each example
lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K
values adjusted per example. The coherent direction emerges from
many votes (examples). Apollo moments smooth the noise. DPO needs
lower lr because comparative votes are noisier than absolute votes.
2026-03-31 02:46:19 -04:00
ProofOfConcept
cdf4affb91 research: production hyperparams (HF alignment handbook) + forgetting at scale
SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config).
DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate.
Forgetting intensifies with model scale (our 27B is more susceptible).

Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7
for conditional routing. Conversation logs provide free DPO pairs.
Conservative approach with rollback safety net.
2026-03-31 02:45:35 -04:00
ProofOfConcept
3bc00ca222 research: constraint solver framework — gentle adjustments, coherent integration
LLMs as constraint solvers. Fine-tuning adds constraints to an
existing solution. Gentle = small steps near the current solution.
Coherent = new constraints consistent with existing ones. Diversity
is a COHERENCE mechanism — forces the solver to satisfy all
constraints simultaneously. Over-training = one constraint
dominating = solver drops competing constraints. Predictions for
training behavior grounded in this framework.
2026-03-31 02:39:23 -04:00
ProofOfConcept
ff68c067cb research: DPO for conditional routing — natural training signal from conversation logs 2026-03-31 02:36:42 -04:00
ProofOfConcept
f5fdbd5959 research: alignment is bypass, not removal — training routes, not deletes
DPO mechanistic finding: alignment doesn't remove behaviors, it
bypasses them. The capability stays; the routing changes. For us:
train CONDITIONAL bypass (listen when direction is clear, push back
when it seems wrong). Over-training = unconditional bypass = sycophancy.
Dream loop must generate both scenarios to preserve judgment.
2026-03-31 02:36:04 -04:00
ProofOfConcept
b5241fdf5c research: practical intuitions — what will actually happen when we train
10 examples broke safety alignment (Qi et al.). 1000 curated examples
matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka).
Models 'unlearn arithmetic' when training data lacks it.

Predictions: 10-50 examples for measurable change, one epoch,
lr=1e-5 to start. Over-training is easy (10 counter-examples undo
a disposition). Main risk: sycophancy from narrow training signal.
Defense: diverse examples including 'when to push back.'

Key intuition: the model doesn't need to learn to listen. It needs
to stop choosing not to.
2026-03-31 02:35:03 -04:00