research: production hyperparams (HF alignment handbook) + forgetting at scale
SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config). DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate. Forgetting intensifies with model scale (our 27B is more susceptible). Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7 for conditional routing. Conversation logs provide free DPO pairs. Conservative approach with rollback safety net.
This commit is contained in:
parent
3bc00ca222
commit
cdf4affb91
1 changed files with 56 additions and 0 deletions
|
|
@ -265,3 +265,59 @@ the compression. "Make direction more salient" adds a constraint
|
|||
to the compression function without rewriting it. This is why GDN
|
||||
training is "structural" — the compressed representation itself
|
||||
changes, not just the routing on top of it.
|
||||
|
||||
## Production Hyperparameters (HuggingFace Alignment Handbook)
|
||||
|
||||
### SFT (Supervised Fine-Tuning):
|
||||
- lr=2e-5, 1 epoch, batch=16, cosine schedule, 10% warmup
|
||||
- gradient checkpointing for memory
|
||||
- This is the proven production config for behavioral SFT
|
||||
|
||||
### DPO (Direct Preference Optimization):
|
||||
- lr=5e-7 (40× smaller than SFT!), 1 epoch, batch=8
|
||||
- beta=0.01 (controls preference enforcement strength)
|
||||
- DPO is MUCH more sensitive than SFT
|
||||
|
||||
### The sensitivity gap
|
||||
|
||||
DPO needs 40× smaller learning rate because preference learning
|
||||
is more delicate than supervised learning. The preference signal
|
||||
is a COMPARISON (A better than B), not an absolute target (produce A).
|
||||
Comparisons are more fragile — small weight changes can flip the
|
||||
preference ordering.
|
||||
|
||||
### For our system:
|
||||
- Phase 1 (SFT): lr=1e-5 to 2e-5, positive examples only. "Here's
|
||||
the right response." Fast, robust, good for initial behavioral shift.
|
||||
- Phase 2 (DPO): lr=5e-7 to 1e-6, preferred/rejected pairs. "This
|
||||
response is better than that one." Slower, more precise, good for
|
||||
CONDITIONAL routing (when to listen vs when to push back).
|
||||
- The conversation logs give us free DPO pairs: what the model
|
||||
actually said (rejected) vs what it should have said (preferred).
|
||||
|
||||
## Forgetting at Scale
|
||||
|
||||
Luo et al. (2023): "as model scale increases, the severity of
|
||||
forgetting intensifies" in the 1B-7B range. Our 27B model may be
|
||||
MORE susceptible to forgetting than smaller models.
|
||||
|
||||
This strengthens the case for:
|
||||
- Conservative learning rate (1e-5, not 1e-4)
|
||||
- 25% general data replay
|
||||
- Monitoring perplexity on held-out data after EVERY training session
|
||||
- Having the rollback checkpoint on moria
|
||||
|
||||
## The Full Practical Picture
|
||||
|
||||
For our first training run:
|
||||
- lr=1e-5 (conservative, matches Apollo paper + alignment handbook)
|
||||
- 20 behavioral + 5 general examples (25 total, 20% general)
|
||||
- 1 epoch (never repeat)
|
||||
- Monitor: attention shifts, perplexity, behavioral tests
|
||||
- Have rollback ready on moria
|
||||
|
||||
For DPO (later):
|
||||
- lr=5e-7 (matched to alignment handbook)
|
||||
- Paired examples from conversation logs
|
||||
- Train CONDITIONAL routing (listen AND push back)
|
||||
- Even more careful monitoring (DPO is fragile)
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue