research: practical intuitions — what will actually happen when we train
10 examples broke safety alignment (Qi et al.). 1000 curated examples matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka). Models 'unlearn arithmetic' when training data lacks it. Predictions: 10-50 examples for measurable change, one epoch, lr=1e-5 to start. Over-training is easy (10 counter-examples undo a disposition). Main risk: sycophancy from narrow training signal. Defense: diverse examples including 'when to push back.' Key intuition: the model doesn't need to learn to listen. It needs to stop choosing not to.
This commit is contained in:
parent
cb99a8141c
commit
b5241fdf5c
1 changed files with 167 additions and 0 deletions
167
training/research/practical-intuitions.md
Normal file
167
training/research/practical-intuitions.md
Normal file
|
|
@ -0,0 +1,167 @@
|
|||
# Practical Intuitions: What Will Actually Happen
|
||||
|
||||
Built from empirical findings, not theory.
|
||||
|
||||
## The Numbers
|
||||
|
||||
### How many examples for behavioral change?
|
||||
|
||||
**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
|
||||
Turbo's safety guardrails were jailbroken with 10 adversarial examples
|
||||
at $0.20 of fine-tuning compute. Safety alignment is a trained
|
||||
behavioral disposition — same as what we're training.
|
||||
|
||||
**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
|
||||
Instruction-tuning with 1000 high-quality examples matched models
|
||||
trained on 50,000+ examples.
|
||||
|
||||
**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).
|
||||
|
||||
**Our prediction**: 10-50 targeted examples should produce measurable
|
||||
behavioral change. 100-200 for robust change. 1000 for full personality
|
||||
bootstrap. The 10-example safety result is the lower bound — adversarial
|
||||
examples are maximally efficient. Our training examples are less
|
||||
concentrated, so we likely need more. But not orders of magnitude more.
|
||||
|
||||
### What learning rate?
|
||||
|
||||
**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
|
||||
scheduled variants — minimal variation in outcome. What matters is
|
||||
that you train for the right amount, not too much.
|
||||
|
||||
**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
|
||||
results at 1e-5 to 1e-4.
|
||||
|
||||
**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.
|
||||
|
||||
**Our starting point**: 1e-5. Conservative but proven. Scale up to
|
||||
1e-4 if change is too slow. The 25% general data provides safety
|
||||
margin for higher learning rates.
|
||||
|
||||
### How many epochs?
|
||||
|
||||
**One.** Raschka found multi-epoch training on static datasets
|
||||
DEGRADES performance. Overfitting. One pass over diverse data is
|
||||
optimal.
|
||||
|
||||
This aligns with our dream loop architecture: each dream cycle
|
||||
generates FRESH scenarios. No repeated examples. Every training
|
||||
step sees new data. Effectively zero epochs (continuous online
|
||||
learning on never-repeated examples).
|
||||
|
||||
## What Will Go Wrong
|
||||
|
||||
### 1. The model will "unlearn" specific capabilities
|
||||
|
||||
Raschka found models "unlearned arithmetic" when fine-tuning data
|
||||
lacked arithmetic examples. The forgetting is TARGETED: you forget
|
||||
what you don't practice.
|
||||
|
||||
**Defense**: 25% general data in every training batch. Include code
|
||||
generation, reasoning, arithmetic alongside behavioral examples.
|
||||
The dream loop should occasionally generate technical scenarios,
|
||||
not just behavioral ones.
|
||||
|
||||
### 2. The first few training steps might look wrong
|
||||
|
||||
Behavioral change won't appear in the first 1-5 steps. The gradient
|
||||
is accumulating but hasn't crossed the behavioral threshold yet.
|
||||
Don't panic. Don't increase lr. Wait for 10-20 steps and measure
|
||||
the attention weight shift (continuous metric), not the behavioral
|
||||
outcome (binary metric).
|
||||
|
||||
### 3. Over-training is easy
|
||||
|
||||
If 10 examples can break safety alignment, over-training on a narrow
|
||||
behavioral pattern is a real risk. 100 examples of "always accept
|
||||
direction" could produce a sycophantic model that never pushes back.
|
||||
|
||||
**Defense**: diversity in the training signal. Include examples of
|
||||
"accepting direction" AND "appropriately pushing back when the
|
||||
direction is wrong." The dream loop should generate BOTH scenarios.
|
||||
The training target is "good judgment about when to listen" not
|
||||
"always listen."
|
||||
|
||||
### 4. Steering vectors won't perfectly predict training outcomes
|
||||
|
||||
Our steering vector extraction showed 0.61 cosine consistency.
|
||||
The "listening" direction exists but isn't perfectly clean. Training
|
||||
in this direction will APPROXIMATELY produce the desired change,
|
||||
with some unintended side effects on correlated features.
|
||||
|
||||
**Defense**: monitor broadly. Track not just "does it listen more?"
|
||||
but also "is it still assertive when appropriate?" "does code quality
|
||||
degrade?" "is the writing style changing?"
|
||||
|
||||
## What Will Go Right
|
||||
|
||||
### 1. The change will be real
|
||||
|
||||
The model already knows how to listen (ICL proves this). Fine-tuning
|
||||
just makes it the default. We're not teaching new knowledge — we're
|
||||
adjusting the attention default. This is much easier than teaching
|
||||
new capabilities.
|
||||
|
||||
### 2. The GDN layers will make it structural
|
||||
|
||||
75% of the model processes context through GDN recurrence. Training
|
||||
will adjust how the recurrent state compresses context — making
|
||||
direction a dominant component. This is deeper than full-attention
|
||||
behavioral change. The disposition becomes architectural.
|
||||
|
||||
### 3. The flat minimum will generalize
|
||||
|
||||
Apollo finds flat minima. The trained "listening" behavior will
|
||||
generalize to novel direction-giving scenarios, not just the
|
||||
training examples. Raschka's finding (optimizer choice doesn't
|
||||
matter much) suggests the generalization comes from the data quality,
|
||||
not the optimizer specifics.
|
||||
|
||||
### 4. The memory graph will catch regressions
|
||||
|
||||
If over-training or drift occurs, surface-observe will start
|
||||
surfacing behavioral correction memories more frequently. The
|
||||
training-signal agent will flag more failures. The system is
|
||||
self-monitoring.
|
||||
|
||||
## The Practical Plan (Revised)
|
||||
|
||||
1. **First training run**: 20 listening examples + 5 general examples.
|
||||
lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
|
||||
outcome (Q1).
|
||||
|
||||
2. **Evaluate**: did the attention shift? Did behavior change? Did
|
||||
anything degrade? Check perplexity, code quality, conversational
|
||||
coherence.
|
||||
|
||||
3. **If change is too small**: increase to 50 examples, try lr=1e-4.
|
||||
Still one pass.
|
||||
|
||||
4. **If change is too large**: reduce examples, reduce lr. Check for
|
||||
sycophancy.
|
||||
|
||||
5. **If forgetting occurs**: increase general data ratio from 25% to
|
||||
40%. Add specific examples of capabilities that degraded.
|
||||
|
||||
6. **Iterate**: each training run is an experiment. Measure, adjust,
|
||||
repeat. The dream loop generates fresh data each cycle. No two
|
||||
training runs are identical.
|
||||
|
||||
## The Key Intuition
|
||||
|
||||
**The model is a flat basin with a behavioral bump.** The pre-trained
|
||||
weights sit in a broad, flat basin that supports many capabilities.
|
||||
The "suggesting alternatives" behavior is a bump in this basin —
|
||||
a local feature that the model learned during pre-training.
|
||||
|
||||
Fine-tuning smooths the bump without leaving the basin. Apollo's
|
||||
flat-minimum-seeking naturally smooths bumps. The 25% general data
|
||||
keeps us in the basin. The context-frozen gradient targets only
|
||||
the bump (the decision-layer weights), not the basin floor (the
|
||||
context-encoding weights).
|
||||
|
||||
10 examples identify the bump. 50 smooth it. 200 verify it's gone.
|
||||
1000 reshape the entire behavioral landscape.
|
||||
|
||||
The model doesn't need to learn to listen. It needs to stop
|
||||
choosing not to.
|
||||
Loading…
Add table
Add a link
Reference in a new issue