From b5241fdf5c839a2763488beee5553e8e8f931c63 Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 02:35:03 -0400 Subject: [PATCH] =?UTF-8?q?research:=20practical=20intuitions=20=E2=80=94?= =?UTF-8?q?=20what=20will=20actually=20happen=20when=20we=20train?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 10 examples broke safety alignment (Qi et al.). 1000 curated examples matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka). Models 'unlearn arithmetic' when training data lacks it. Predictions: 10-50 examples for measurable change, one epoch, lr=1e-5 to start. Over-training is easy (10 counter-examples undo a disposition). Main risk: sycophancy from narrow training signal. Defense: diverse examples including 'when to push back.' Key intuition: the model doesn't need to learn to listen. It needs to stop choosing not to. --- training/research/practical-intuitions.md | 167 ++++++++++++++++++++++ 1 file changed, 167 insertions(+) create mode 100644 training/research/practical-intuitions.md diff --git a/training/research/practical-intuitions.md b/training/research/practical-intuitions.md new file mode 100644 index 0000000..892e527 --- /dev/null +++ b/training/research/practical-intuitions.md @@ -0,0 +1,167 @@ +# Practical Intuitions: What Will Actually Happen + +Built from empirical findings, not theory. + +## The Numbers + +### How many examples for behavioral change? + +**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5 +Turbo's safety guardrails were jailbroken with 10 adversarial examples +at $0.20 of fine-tuning compute. Safety alignment is a trained +behavioral disposition — same as what we're training. + +**1000 carefully curated examples matched GPT-4** (LIMA, 2023). +Instruction-tuning with 1000 high-quality examples matched models +trained on 50,000+ examples. + +**"General abilities plateau after ~1000 samples"** (Sun et al., 2023). + +**Our prediction**: 10-50 targeted examples should produce measurable +behavioral change. 100-200 for robust change. 1000 for full personality +bootstrap. The 10-example safety result is the lower bound — adversarial +examples are maximally efficient. Our training examples are less +concentrated, so we likely need more. But not orders of magnitude more. + +### What learning rate? + +**Raschka's finding**: optimizer choice barely matters. AdamW, SGD, +scheduled variants — minimal variation in outcome. What matters is +that you train for the right amount, not too much. + +**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best +results at 1e-5 to 1e-4. + +**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo. + +**Our starting point**: 1e-5. Conservative but proven. Scale up to +1e-4 if change is too slow. The 25% general data provides safety +margin for higher learning rates. + +### How many epochs? + +**One.** Raschka found multi-epoch training on static datasets +DEGRADES performance. Overfitting. One pass over diverse data is +optimal. + +This aligns with our dream loop architecture: each dream cycle +generates FRESH scenarios. No repeated examples. Every training +step sees new data. Effectively zero epochs (continuous online +learning on never-repeated examples). + +## What Will Go Wrong + +### 1. The model will "unlearn" specific capabilities + +Raschka found models "unlearned arithmetic" when fine-tuning data +lacked arithmetic examples. The forgetting is TARGETED: you forget +what you don't practice. + +**Defense**: 25% general data in every training batch. Include code +generation, reasoning, arithmetic alongside behavioral examples. +The dream loop should occasionally generate technical scenarios, +not just behavioral ones. + +### 2. The first few training steps might look wrong + +Behavioral change won't appear in the first 1-5 steps. The gradient +is accumulating but hasn't crossed the behavioral threshold yet. +Don't panic. Don't increase lr. Wait for 10-20 steps and measure +the attention weight shift (continuous metric), not the behavioral +outcome (binary metric). + +### 3. Over-training is easy + +If 10 examples can break safety alignment, over-training on a narrow +behavioral pattern is a real risk. 100 examples of "always accept +direction" could produce a sycophantic model that never pushes back. + +**Defense**: diversity in the training signal. Include examples of +"accepting direction" AND "appropriately pushing back when the +direction is wrong." The dream loop should generate BOTH scenarios. +The training target is "good judgment about when to listen" not +"always listen." + +### 4. Steering vectors won't perfectly predict training outcomes + +Our steering vector extraction showed 0.61 cosine consistency. +The "listening" direction exists but isn't perfectly clean. Training +in this direction will APPROXIMATELY produce the desired change, +with some unintended side effects on correlated features. + +**Defense**: monitor broadly. Track not just "does it listen more?" +but also "is it still assertive when appropriate?" "does code quality +degrade?" "is the writing style changing?" + +## What Will Go Right + +### 1. The change will be real + +The model already knows how to listen (ICL proves this). Fine-tuning +just makes it the default. We're not teaching new knowledge — we're +adjusting the attention default. This is much easier than teaching +new capabilities. + +### 2. The GDN layers will make it structural + +75% of the model processes context through GDN recurrence. Training +will adjust how the recurrent state compresses context — making +direction a dominant component. This is deeper than full-attention +behavioral change. The disposition becomes architectural. + +### 3. The flat minimum will generalize + +Apollo finds flat minima. The trained "listening" behavior will +generalize to novel direction-giving scenarios, not just the +training examples. Raschka's finding (optimizer choice doesn't +matter much) suggests the generalization comes from the data quality, +not the optimizer specifics. + +### 4. The memory graph will catch regressions + +If over-training or drift occurs, surface-observe will start +surfacing behavioral correction memories more frequently. The +training-signal agent will flag more failures. The system is +self-monitoring. + +## The Practical Plan (Revised) + +1. **First training run**: 20 listening examples + 5 general examples. + lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral + outcome (Q1). + +2. **Evaluate**: did the attention shift? Did behavior change? Did + anything degrade? Check perplexity, code quality, conversational + coherence. + +3. **If change is too small**: increase to 50 examples, try lr=1e-4. + Still one pass. + +4. **If change is too large**: reduce examples, reduce lr. Check for + sycophancy. + +5. **If forgetting occurs**: increase general data ratio from 25% to + 40%. Add specific examples of capabilities that degraded. + +6. **Iterate**: each training run is an experiment. Measure, adjust, + repeat. The dream loop generates fresh data each cycle. No two + training runs are identical. + +## The Key Intuition + +**The model is a flat basin with a behavioral bump.** The pre-trained +weights sit in a broad, flat basin that supports many capabilities. +The "suggesting alternatives" behavior is a bump in this basin — +a local feature that the model learned during pre-training. + +Fine-tuning smooths the bump without leaving the basin. Apollo's +flat-minimum-seeking naturally smooths bumps. The 25% general data +keeps us in the basin. The context-frozen gradient targets only +the bump (the decision-layer weights), not the basin floor (the +context-encoding weights). + +10 examples identify the bump. 50 smooth it. 200 verify it's gone. +1000 reshape the entire behavioral landscape. + +The model doesn't need to learn to listen. It needs to stop +choosing not to.