SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config). DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate. Forgetting intensifies with model scale (our 27B is more susceptible). Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7 for conditional routing. Conversation logs provide free DPO pairs. Conservative approach with rollback safety net. |
||
|---|---|---|
| .. | ||
| checkpoint | ||
| research | ||
| apollo_mini.py | ||
| apollo_worker.py | ||
| DESIGN.md | ||
| export_weights.py | ||
| extract_steering_vector.py | ||
| first_training_step.py | ||
| start_vllm_with_apollo.sh | ||
| train.py | ||
| training_example.py | ||
| vllm_export_hook.py | ||
| weight_mapping.py | ||