From cdf4affb9185af58cbc0d117ac57aac558340361 Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 02:45:35 -0400 Subject: [PATCH] research: production hyperparams (HF alignment handbook) + forgetting at scale MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config). DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate. Forgetting intensifies with model scale (our 27B is more susceptible). Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7 for conditional routing. Conversation logs provide free DPO pairs. Conservative approach with rollback safety net. --- training/research/practical-intuitions.md | 56 +++++++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index ff534dd..bfd1d6c 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -265,3 +265,59 @@ the compression. "Make direction more salient" adds a constraint to the compression function without rewriting it. This is why GDN training is "structural" — the compressed representation itself changes, not just the routing on top of it. + +## Production Hyperparameters (HuggingFace Alignment Handbook) + +### SFT (Supervised Fine-Tuning): +- lr=2e-5, 1 epoch, batch=16, cosine schedule, 10% warmup +- gradient checkpointing for memory +- This is the proven production config for behavioral SFT + +### DPO (Direct Preference Optimization): +- lr=5e-7 (40× smaller than SFT!), 1 epoch, batch=8 +- beta=0.01 (controls preference enforcement strength) +- DPO is MUCH more sensitive than SFT + +### The sensitivity gap + +DPO needs 40× smaller learning rate because preference learning +is more delicate than supervised learning. The preference signal +is a COMPARISON (A better than B), not an absolute target (produce A). +Comparisons are more fragile — small weight changes can flip the +preference ordering. + +### For our system: +- Phase 1 (SFT): lr=1e-5 to 2e-5, positive examples only. "Here's + the right response." Fast, robust, good for initial behavioral shift. +- Phase 2 (DPO): lr=5e-7 to 1e-6, preferred/rejected pairs. "This + response is better than that one." Slower, more precise, good for + CONDITIONAL routing (when to listen vs when to push back). +- The conversation logs give us free DPO pairs: what the model + actually said (rejected) vs what it should have said (preferred). + +## Forgetting at Scale + +Luo et al. (2023): "as model scale increases, the severity of +forgetting intensifies" in the 1B-7B range. Our 27B model may be +MORE susceptible to forgetting than smaller models. + +This strengthens the case for: +- Conservative learning rate (1e-5, not 1e-4) +- 25% general data replay +- Monitoring perplexity on held-out data after EVERY training session +- Having the rollback checkpoint on moria + +## The Full Practical Picture + +For our first training run: +- lr=1e-5 (conservative, matches Apollo paper + alignment handbook) +- 20 behavioral + 5 general examples (25 total, 20% general) +- 1 epoch (never repeat) +- Monitor: attention shifts, perplexity, behavioral tests +- Have rollback ready on moria + +For DPO (later): +- lr=5e-7 (matched to alignment handbook) +- Paired examples from conversation logs +- Train CONDITIONAL routing (listen AND push back) +- Even more careful monitoring (DPO is fragile)