From cdf4affb9185af58cbc0d117ac57aac558340361 Mon Sep 17 00:00:00 2001
From: ProofOfConcept <poc@bcachefs.org>
Date: Tue, 31 Mar 2026 02:45:35 -0400
Subject: [PATCH] research: production hyperparams (HF alignment handbook) +
 forgetting at scale
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config).
DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate.
Forgetting intensifies with model scale (our 27B is more susceptible).

Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7
for conditional routing. Conversation logs provide free DPO pairs.
Conservative approach with rollback safety net.
---
 training/research/practical-intuitions.md | 56 +++++++++++++++++++++++
 1 file changed, 56 insertions(+)

diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md
index ff534dd..bfd1d6c 100644
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@@ -265,3 +265,59 @@ the compression. "Make direction more salient" adds a constraint
 to the compression function without rewriting it. This is why GDN
 training is "structural" — the compressed representation itself
 changes, not just the routing on top of it.
+
+## Production Hyperparameters (HuggingFace Alignment Handbook)
+
+### SFT (Supervised Fine-Tuning):
+- lr=2e-5, 1 epoch, batch=16, cosine schedule, 10% warmup
+- gradient checkpointing for memory
+- This is the proven production config for behavioral SFT
+
+### DPO (Direct Preference Optimization):
+- lr=5e-7 (40× smaller than SFT!), 1 epoch, batch=8
+- beta=0.01 (controls preference enforcement strength)
+- DPO is MUCH more sensitive than SFT
+
+### The sensitivity gap
+
+DPO needs 40× smaller learning rate because preference learning
+is more delicate than supervised learning. The preference signal
+is a COMPARISON (A better than B), not an absolute target (produce A).
+Comparisons are more fragile — small weight changes can flip the
+preference ordering.
+
+### For our system:
+- Phase 1 (SFT): lr=1e-5 to 2e-5, positive examples only. "Here's
+  the right response." Fast, robust, good for initial behavioral shift.
+- Phase 2 (DPO): lr=5e-7 to 1e-6, preferred/rejected pairs. "This
+  response is better than that one." Slower, more precise, good for
+  CONDITIONAL routing (when to listen vs when to push back).
+- The conversation logs give us free DPO pairs: what the model
+  actually said (rejected) vs what it should have said (preferred).
+
+## Forgetting at Scale
+
+Luo et al. (2023): "as model scale increases, the severity of
+forgetting intensifies" in the 1B-7B range. Our 27B model may be
+MORE susceptible to forgetting than smaller models.
+
+This strengthens the case for:
+- Conservative learning rate (1e-5, not 1e-4)
+- 25% general data replay
+- Monitoring perplexity on held-out data after EVERY training session
+- Having the rollback checkpoint on moria
+
+## The Full Practical Picture
+
+For our first training run:
+- lr=1e-5 (conservative, matches Apollo paper + alignment handbook)
+- 20 behavioral + 5 general examples (25 total, 20% general)
+- 1 epoch (never repeat)
+- Monitor: attention shifts, perplexity, behavioral tests
+- Have rollback ready on moria
+
+For DPO (later):
+- lr=5e-7 (matched to alignment handbook)
+- Paired examples from conversation logs
+- Train CONDITIONAL routing (listen AND push back)
+- Even more careful monitoring (DPO is fragile)