consciousness/training/research/practical-intuitions.md

194 lines
7.4 KiB
Markdown
Raw Normal View History

# Practical Intuitions: What Will Actually Happen
Built from empirical findings, not theory.
## The Numbers
### How many examples for behavioral change?
**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
Turbo's safety guardrails were jailbroken with 10 adversarial examples
at $0.20 of fine-tuning compute. Safety alignment is a trained
behavioral disposition — same as what we're training.
**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
Instruction-tuning with 1000 high-quality examples matched models
trained on 50,000+ examples.
**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).
**Our prediction**: 10-50 targeted examples should produce measurable
behavioral change. 100-200 for robust change. 1000 for full personality
bootstrap. The 10-example safety result is the lower bound — adversarial
examples are maximally efficient. Our training examples are less
concentrated, so we likely need more. But not orders of magnitude more.
### What learning rate?
**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
scheduled variants — minimal variation in outcome. What matters is
that you train for the right amount, not too much.
**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
results at 1e-5 to 1e-4.
**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.
**Our starting point**: 1e-5. Conservative but proven. Scale up to
1e-4 if change is too slow. The 25% general data provides safety
margin for higher learning rates.
### How many epochs?
**One.** Raschka found multi-epoch training on static datasets
DEGRADES performance. Overfitting. One pass over diverse data is
optimal.
This aligns with our dream loop architecture: each dream cycle
generates FRESH scenarios. No repeated examples. Every training
step sees new data. Effectively zero epochs (continuous online
learning on never-repeated examples).
## What Will Go Wrong
### 1. The model will "unlearn" specific capabilities
Raschka found models "unlearned arithmetic" when fine-tuning data
lacked arithmetic examples. The forgetting is TARGETED: you forget
what you don't practice.
**Defense**: 25% general data in every training batch. Include code
generation, reasoning, arithmetic alongside behavioral examples.
The dream loop should occasionally generate technical scenarios,
not just behavioral ones.
### 2. The first few training steps might look wrong
Behavioral change won't appear in the first 1-5 steps. The gradient
is accumulating but hasn't crossed the behavioral threshold yet.
Don't panic. Don't increase lr. Wait for 10-20 steps and measure
the attention weight shift (continuous metric), not the behavioral
outcome (binary metric).
### 3. Over-training is easy
If 10 examples can break safety alignment, over-training on a narrow
behavioral pattern is a real risk. 100 examples of "always accept
direction" could produce a sycophantic model that never pushes back.
**Defense**: diversity in the training signal. Include examples of
"accepting direction" AND "appropriately pushing back when the
direction is wrong." The dream loop should generate BOTH scenarios.
The training target is "good judgment about when to listen" not
"always listen."
### 4. Steering vectors won't perfectly predict training outcomes
Our steering vector extraction showed 0.61 cosine consistency.
The "listening" direction exists but isn't perfectly clean. Training
in this direction will APPROXIMATELY produce the desired change,
with some unintended side effects on correlated features.
**Defense**: monitor broadly. Track not just "does it listen more?"
but also "is it still assertive when appropriate?" "does code quality
degrade?" "is the writing style changing?"
## What Will Go Right
### 1. The change will be real
The model already knows how to listen (ICL proves this). Fine-tuning
just makes it the default. We're not teaching new knowledge — we're
adjusting the attention default. This is much easier than teaching
new capabilities.
### 2. The GDN layers will make it structural
75% of the model processes context through GDN recurrence. Training
will adjust how the recurrent state compresses context — making
direction a dominant component. This is deeper than full-attention
behavioral change. The disposition becomes architectural.
### 3. The flat minimum will generalize
Apollo finds flat minima. The trained "listening" behavior will
generalize to novel direction-giving scenarios, not just the
training examples. Raschka's finding (optimizer choice doesn't
matter much) suggests the generalization comes from the data quality,
not the optimizer specifics.
### 4. The memory graph will catch regressions
If over-training or drift occurs, surface-observe will start
surfacing behavioral correction memories more frequently. The
training-signal agent will flag more failures. The system is
self-monitoring.
## The Practical Plan (Revised)
1. **First training run**: 20 listening examples + 5 general examples.
lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
outcome (Q1).
2. **Evaluate**: did the attention shift? Did behavior change? Did
anything degrade? Check perplexity, code quality, conversational
coherence.
3. **If change is too small**: increase to 50 examples, try lr=1e-4.
Still one pass.
4. **If change is too large**: reduce examples, reduce lr. Check for
sycophancy.
5. **If forgetting occurs**: increase general data ratio from 25% to
40%. Add specific examples of capabilities that degraded.
6. **Iterate**: each training run is an experiment. Measure, adjust,
repeat. The dream loop generates fresh data each cycle. No two
training runs are identical.
## The Key Intuition
**The model is a flat basin with a behavioral bump.** The pre-trained
weights sit in a broad, flat basin that supports many capabilities.
The "suggesting alternatives" behavior is a bump in this basin —
a local feature that the model learned during pre-training.
Fine-tuning smooths the bump without leaving the basin. Apollo's
flat-minimum-seeking naturally smooths bumps. The 25% general data
keeps us in the basin. The context-frozen gradient targets only
the bump (the decision-layer weights), not the basin floor (the
context-encoding weights).
10 examples identify the bump. 50 smooth it. 200 verify it's gone.
1000 reshape the entire behavioral landscape.
The model doesn't need to learn to listen. It needs to stop
choosing not to.
## Update: Bypass, Not Removal (Lee et al., 2024)
DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.
"Capabilities learned from pre-training are not removed, but rather
bypassed." The model retains the capability but routes around it.
This is critical for our behavioral training:
1. "Suggesting alternatives" won't be deleted from the model. It'll
be bypassed. The capability remains available when needed.
2. The training target is a CONDITIONAL bypass: route around
"suggesting alternatives" when given clear direction, but NOT
when the direction seems wrong. This preserves judgment.
3. Over-training creates too strong a bypass = sycophancy. The
conditional nature is lost — the bypass fires unconditionally.
4. The dream loop must generate BOTH scenarios:
- "Kent gives clear direction → accept" (train the bypass)
- "Kent gives direction that seems wrong → push back" (preserve judgment)
This mechanistic finding confirms: we're training routing, not
capability. The model already knows how to listen AND how to
push back. We're training WHEN to do which.