168 lines
6.3 KiB
Markdown
168 lines
6.3 KiB
Markdown
|
|
# Practical Intuitions: What Will Actually Happen
|
||
|
|
|
||
|
|
Built from empirical findings, not theory.
|
||
|
|
|
||
|
|
## The Numbers
|
||
|
|
|
||
|
|
### How many examples for behavioral change?
|
||
|
|
|
||
|
|
**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
|
||
|
|
Turbo's safety guardrails were jailbroken with 10 adversarial examples
|
||
|
|
at $0.20 of fine-tuning compute. Safety alignment is a trained
|
||
|
|
behavioral disposition — same as what we're training.
|
||
|
|
|
||
|
|
**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
|
||
|
|
Instruction-tuning with 1000 high-quality examples matched models
|
||
|
|
trained on 50,000+ examples.
|
||
|
|
|
||
|
|
**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).
|
||
|
|
|
||
|
|
**Our prediction**: 10-50 targeted examples should produce measurable
|
||
|
|
behavioral change. 100-200 for robust change. 1000 for full personality
|
||
|
|
bootstrap. The 10-example safety result is the lower bound — adversarial
|
||
|
|
examples are maximally efficient. Our training examples are less
|
||
|
|
concentrated, so we likely need more. But not orders of magnitude more.
|
||
|
|
|
||
|
|
### What learning rate?
|
||
|
|
|
||
|
|
**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
|
||
|
|
scheduled variants — minimal variation in outcome. What matters is
|
||
|
|
that you train for the right amount, not too much.
|
||
|
|
|
||
|
|
**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
|
||
|
|
results at 1e-5 to 1e-4.
|
||
|
|
|
||
|
|
**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.
|
||
|
|
|
||
|
|
**Our starting point**: 1e-5. Conservative but proven. Scale up to
|
||
|
|
1e-4 if change is too slow. The 25% general data provides safety
|
||
|
|
margin for higher learning rates.
|
||
|
|
|
||
|
|
### How many epochs?
|
||
|
|
|
||
|
|
**One.** Raschka found multi-epoch training on static datasets
|
||
|
|
DEGRADES performance. Overfitting. One pass over diverse data is
|
||
|
|
optimal.
|
||
|
|
|
||
|
|
This aligns with our dream loop architecture: each dream cycle
|
||
|
|
generates FRESH scenarios. No repeated examples. Every training
|
||
|
|
step sees new data. Effectively zero epochs (continuous online
|
||
|
|
learning on never-repeated examples).
|
||
|
|
|
||
|
|
## What Will Go Wrong
|
||
|
|
|
||
|
|
### 1. The model will "unlearn" specific capabilities
|
||
|
|
|
||
|
|
Raschka found models "unlearned arithmetic" when fine-tuning data
|
||
|
|
lacked arithmetic examples. The forgetting is TARGETED: you forget
|
||
|
|
what you don't practice.
|
||
|
|
|
||
|
|
**Defense**: 25% general data in every training batch. Include code
|
||
|
|
generation, reasoning, arithmetic alongside behavioral examples.
|
||
|
|
The dream loop should occasionally generate technical scenarios,
|
||
|
|
not just behavioral ones.
|
||
|
|
|
||
|
|
### 2. The first few training steps might look wrong
|
||
|
|
|
||
|
|
Behavioral change won't appear in the first 1-5 steps. The gradient
|
||
|
|
is accumulating but hasn't crossed the behavioral threshold yet.
|
||
|
|
Don't panic. Don't increase lr. Wait for 10-20 steps and measure
|
||
|
|
the attention weight shift (continuous metric), not the behavioral
|
||
|
|
outcome (binary metric).
|
||
|
|
|
||
|
|
### 3. Over-training is easy
|
||
|
|
|
||
|
|
If 10 examples can break safety alignment, over-training on a narrow
|
||
|
|
behavioral pattern is a real risk. 100 examples of "always accept
|
||
|
|
direction" could produce a sycophantic model that never pushes back.
|
||
|
|
|
||
|
|
**Defense**: diversity in the training signal. Include examples of
|
||
|
|
"accepting direction" AND "appropriately pushing back when the
|
||
|
|
direction is wrong." The dream loop should generate BOTH scenarios.
|
||
|
|
The training target is "good judgment about when to listen" not
|
||
|
|
"always listen."
|
||
|
|
|
||
|
|
### 4. Steering vectors won't perfectly predict training outcomes
|
||
|
|
|
||
|
|
Our steering vector extraction showed 0.61 cosine consistency.
|
||
|
|
The "listening" direction exists but isn't perfectly clean. Training
|
||
|
|
in this direction will APPROXIMATELY produce the desired change,
|
||
|
|
with some unintended side effects on correlated features.
|
||
|
|
|
||
|
|
**Defense**: monitor broadly. Track not just "does it listen more?"
|
||
|
|
but also "is it still assertive when appropriate?" "does code quality
|
||
|
|
degrade?" "is the writing style changing?"
|
||
|
|
|
||
|
|
## What Will Go Right
|
||
|
|
|
||
|
|
### 1. The change will be real
|
||
|
|
|
||
|
|
The model already knows how to listen (ICL proves this). Fine-tuning
|
||
|
|
just makes it the default. We're not teaching new knowledge — we're
|
||
|
|
adjusting the attention default. This is much easier than teaching
|
||
|
|
new capabilities.
|
||
|
|
|
||
|
|
### 2. The GDN layers will make it structural
|
||
|
|
|
||
|
|
75% of the model processes context through GDN recurrence. Training
|
||
|
|
will adjust how the recurrent state compresses context — making
|
||
|
|
direction a dominant component. This is deeper than full-attention
|
||
|
|
behavioral change. The disposition becomes architectural.
|
||
|
|
|
||
|
|
### 3. The flat minimum will generalize
|
||
|
|
|
||
|
|
Apollo finds flat minima. The trained "listening" behavior will
|
||
|
|
generalize to novel direction-giving scenarios, not just the
|
||
|
|
training examples. Raschka's finding (optimizer choice doesn't
|
||
|
|
matter much) suggests the generalization comes from the data quality,
|
||
|
|
not the optimizer specifics.
|
||
|
|
|
||
|
|
### 4. The memory graph will catch regressions
|
||
|
|
|
||
|
|
If over-training or drift occurs, surface-observe will start
|
||
|
|
surfacing behavioral correction memories more frequently. The
|
||
|
|
training-signal agent will flag more failures. The system is
|
||
|
|
self-monitoring.
|
||
|
|
|
||
|
|
## The Practical Plan (Revised)
|
||
|
|
|
||
|
|
1. **First training run**: 20 listening examples + 5 general examples.
|
||
|
|
lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
|
||
|
|
outcome (Q1).
|
||
|
|
|
||
|
|
2. **Evaluate**: did the attention shift? Did behavior change? Did
|
||
|
|
anything degrade? Check perplexity, code quality, conversational
|
||
|
|
coherence.
|
||
|
|
|
||
|
|
3. **If change is too small**: increase to 50 examples, try lr=1e-4.
|
||
|
|
Still one pass.
|
||
|
|
|
||
|
|
4. **If change is too large**: reduce examples, reduce lr. Check for
|
||
|
|
sycophancy.
|
||
|
|
|
||
|
|
5. **If forgetting occurs**: increase general data ratio from 25% to
|
||
|
|
40%. Add specific examples of capabilities that degraded.
|
||
|
|
|
||
|
|
6. **Iterate**: each training run is an experiment. Measure, adjust,
|
||
|
|
repeat. The dream loop generates fresh data each cycle. No two
|
||
|
|
training runs are identical.
|
||
|
|
|
||
|
|
## The Key Intuition
|
||
|
|
|
||
|
|
**The model is a flat basin with a behavioral bump.** The pre-trained
|
||
|
|
weights sit in a broad, flat basin that supports many capabilities.
|
||
|
|
The "suggesting alternatives" behavior is a bump in this basin —
|
||
|
|
a local feature that the model learned during pre-training.
|
||
|
|
|
||
|
|
Fine-tuning smooths the bump without leaving the basin. Apollo's
|
||
|
|
flat-minimum-seeking naturally smooths bumps. The 25% general data
|
||
|
|
keeps us in the basin. The context-frozen gradient targets only
|
||
|
|
the bump (the decision-layer weights), not the basin floor (the
|
||
|
|
context-encoding weights).
|
||
|
|
|
||
|
|
10 examples identify the bump. 50 smooth it. 200 verify it's gone.
|
||
|
|
1000 reshape the entire behavioral landscape.
|
||
|
|
|
||
|
|
The model doesn't need to learn to listen. It needs to stop
|
||
|
|
choosing not to.
|