diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md index 7fa2efd..ff534dd 100644 --- a/training/research/practical-intuitions.md +++ b/training/research/practical-intuitions.md @@ -213,3 +213,55 @@ after the first Apollo training run validates the basic pipeline. LLaMA-Factory supports DPO. The dream loop could generate DPO pairs (both preferred and rejected continuations for each scenario). + +## The Constraint Solver Framework + +LLMs are giant constraint solvers. Pre-training finds a solution +satisfying billions of constraints (knowledge, grammar, reasoning, +style). Fine-tuning adds new constraints. + +### What "gentle" means + +Small adjustments per step. The solver stays near the current +solution, finding nearby solutions that ALSO satisfy the new +constraint. The current solution already approximately satisfies +most behavioral constraints — we're tightening, not creating. + +### What "coherent integration" means + +New constraints must be CONSISTENT with existing ones: +- "Listen to clear direction" is consistent with "be helpful" → integrates smoothly +- "Always agree" contradicts "maintain judgment" → solver drops one +- The training data must express REFINEMENT, not contradiction + +### Why diversity is a COHERENCE mechanism, not just forgetting defense + +Diverse constraints force the solver to find solutions satisfying +ALL of them simultaneously. Narrow constraints let the solver +specialize at the expense of everything else. + +Every training batch should include mutually consistent constraints: +"listen well" + "think critically" + "write good code" + "be honest." +The solver integrates all of them. No single constraint dominates. + +### Predictions + +1. Constraints consistent with existing knowledge integrate in + ~10-50 examples (tightening existing constraints) +2. Contradictory constraints cause breakage in ~10 examples + (the safety alignment result) +3. The learning rate controls step size, not direction — the + gradient points the right way, lr controls how far to step +4. Over-training = one constraint dominating = solver dropping + competing constraints to satisfy the dominant one +5. The dream loop must generate scenarios exercising MULTIPLE + constraints simultaneously, not just the target behavior + +### The GDN connection + +The GDN recurrent state is a compressed constraint satisfaction +solution. Training adjusts which constraints are prioritized in +the compression. "Make direction more salient" adds a constraint +to the compression function without rewriting it. This is why GDN +training is "structural" — the compressed representation itself +changes, not just the routing on top of it.