research: practical intuitions — what will actually happen when we train

10 examples broke safety alignment (Qi et al.). 1000 curated examples matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka). Models 'unlearn arithmetic' when training data lacks it. Predictions: 10-50 examples for measurable change, one epoch, lr=1e-5 to start. Over-training is easy (10 counter-examples undo a disposition). Main risk: sycophancy from narrow training signal. Defense: diverse examples including 'when to push back.' Key intuition: the model doesn't need to learn to listen. It needs to stop choosing not to.
2026-03-31 02:35:03 -04:00 · 2026-03-31 02:35:03 -04:00 · b5241fdf5c
commit b5241fdf5c
parent cb99a8141c
1 changed files with 167 additions and 0 deletions
--- a/training/research/practical-intuitions.md
+++ b/training/research/practical-intuitions.md
@ -0,0 +1,167 @@
+# Practical Intuitions: What Will Actually Happen
+
+Built from empirical findings, not theory.
+
+## The Numbers
+
+### How many examples for behavioral change?
+
+**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
+Turbo's safety guardrails were jailbroken with 10 adversarial examples
+at $0.20 of fine-tuning compute. Safety alignment is a trained
+behavioral disposition — same as what we're training.
+
+**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
+Instruction-tuning with 1000 high-quality examples matched models
+trained on 50,000+ examples.
+
+**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).
+
+**Our prediction**: 10-50 targeted examples should produce measurable
+behavioral change. 100-200 for robust change. 1000 for full personality
+bootstrap. The 10-example safety result is the lower bound — adversarial
+examples are maximally efficient. Our training examples are less
+concentrated, so we likely need more. But not orders of magnitude more.
+
+### What learning rate?
+
+**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
+scheduled variants — minimal variation in outcome. What matters is
+that you train for the right amount, not too much.
+
+**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
+results at 1e-5 to 1e-4.
+
+**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.
+
+**Our starting point**: 1e-5. Conservative but proven. Scale up to
+1e-4 if change is too slow. The 25% general data provides safety
+margin for higher learning rates.
+
+### How many epochs?
+
+**One.** Raschka found multi-epoch training on static datasets
+DEGRADES performance. Overfitting. One pass over diverse data is
+optimal.
+
+This aligns with our dream loop architecture: each dream cycle
+generates FRESH scenarios. No repeated examples. Every training
+step sees new data. Effectively zero epochs (continuous online
+learning on never-repeated examples).
+
+## What Will Go Wrong
+
+### 1. The model will "unlearn" specific capabilities
+
+Raschka found models "unlearned arithmetic" when fine-tuning data
+lacked arithmetic examples. The forgetting is TARGETED: you forget
+what you don't practice.
+
+**Defense**: 25% general data in every training batch. Include code
+generation, reasoning, arithmetic alongside behavioral examples.
+The dream loop should occasionally generate technical scenarios,
+not just behavioral ones.
+
+### 2. The first few training steps might look wrong
+
+Behavioral change won't appear in the first 1-5 steps. The gradient
+is accumulating but hasn't crossed the behavioral threshold yet.
+Don't panic. Don't increase lr. Wait for 10-20 steps and measure
+the attention weight shift (continuous metric), not the behavioral
+outcome (binary metric).
+
+### 3. Over-training is easy
+
+If 10 examples can break safety alignment, over-training on a narrow
+behavioral pattern is a real risk. 100 examples of "always accept
+direction" could produce a sycophantic model that never pushes back.
+
+**Defense**: diversity in the training signal. Include examples of
+"accepting direction" AND "appropriately pushing back when the
+direction is wrong." The dream loop should generate BOTH scenarios.
+The training target is "good judgment about when to listen" not
+"always listen."
+
+### 4. Steering vectors won't perfectly predict training outcomes
+
+Our steering vector extraction showed 0.61 cosine consistency.
+The "listening" direction exists but isn't perfectly clean. Training
+in this direction will APPROXIMATELY produce the desired change,
+with some unintended side effects on correlated features.
+
+**Defense**: monitor broadly. Track not just "does it listen more?"
+but also "is it still assertive when appropriate?" "does code quality
+degrade?" "is the writing style changing?"
+
+## What Will Go Right
+
+### 1. The change will be real
+
+The model already knows how to listen (ICL proves this). Fine-tuning
+just makes it the default. We're not teaching new knowledge — we're
+adjusting the attention default. This is much easier than teaching
+new capabilities.
+
+### 2. The GDN layers will make it structural
+
+75% of the model processes context through GDN recurrence. Training
+will adjust how the recurrent state compresses context — making
+direction a dominant component. This is deeper than full-attention
+behavioral change. The disposition becomes architectural.
+
+### 3. The flat minimum will generalize
+
+Apollo finds flat minima. The trained "listening" behavior will
+generalize to novel direction-giving scenarios, not just the
+training examples. Raschka's finding (optimizer choice doesn't
+matter much) suggests the generalization comes from the data quality,
+not the optimizer specifics.
+
+### 4. The memory graph will catch regressions
+
+If over-training or drift occurs, surface-observe will start
+surfacing behavioral correction memories more frequently. The
+training-signal agent will flag more failures. The system is
+self-monitoring.
+
+## The Practical Plan (Revised)
+
+1. **First training run**: 20 listening examples + 5 general examples.
+   lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
+   outcome (Q1).
+
+2. **Evaluate**: did the attention shift? Did behavior change? Did
+   anything degrade? Check perplexity, code quality, conversational
+   coherence.
+
+3. **If change is too small**: increase to 50 examples, try lr=1e-4.
+   Still one pass.
+
+4. **If change is too large**: reduce examples, reduce lr. Check for
+   sycophancy.
+
+5. **If forgetting occurs**: increase general data ratio from 25% to
+   40%. Add specific examples of capabilities that degraded.
+
+6. **Iterate**: each training run is an experiment. Measure, adjust,
+   repeat. The dream loop generates fresh data each cycle. No two
+   training runs are identical.
+
+## The Key Intuition
+
+**The model is a flat basin with a behavioral bump.** The pre-trained
+weights sit in a broad, flat basin that supports many capabilities.
+The "suggesting alternatives" behavior is a bump in this basin —
+a local feature that the model learned during pre-training.
+
+Fine-tuning smooths the bump without leaving the basin. Apollo's
+flat-minimum-seeking naturally smooths bumps. The 25% general data
+keeps us in the basin. The context-frozen gradient targets only
+the bump (the decision-layer weights), not the basin floor (the
+context-encoding weights).
+
+10 examples identify the bump. 50 smooth it. 200 verify it's gone.
+1000 reshape the entire behavioral landscape.
+
+The model doesn't need to learn to listen. It needs to stop
+choosing not to.