consciousness/training/research/practical-intuitions.md
ProofOfConcept d6b85d204a research: on-policy beats off-policy, DPO failure modes, variant landscape
On-policy rejected examples (model's own failures) are better training
signal than off-policy (pre-collected). Our temperature sweep is on-policy
by construction. DPO can accidentally reduce preferred likelihood (DPOP
fixes this). Multiple DPO variants exist — start with ORPO, switch only
if specific failure modes observed.
2026-03-31 03:19:27 -04:00

442 lines
17 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Practical Intuitions: What Will Actually Happen
Built from empirical findings, not theory.
## The Numbers
### How many examples for behavioral change?
**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
Turbo's safety guardrails were jailbroken with 10 adversarial examples
at $0.20 of fine-tuning compute. Safety alignment is a trained
behavioral disposition — same as what we're training.
**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
Instruction-tuning with 1000 high-quality examples matched models
trained on 50,000+ examples.
**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).
**Our prediction**: 10-50 targeted examples should produce measurable
behavioral change. 100-200 for robust change. 1000 for full personality
bootstrap. The 10-example safety result is the lower bound — adversarial
examples are maximally efficient. Our training examples are less
concentrated, so we likely need more. But not orders of magnitude more.
### What learning rate?
**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
scheduled variants — minimal variation in outcome. What matters is
that you train for the right amount, not too much.
**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
results at 1e-5 to 1e-4.
**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.
**Our starting point**: 1e-5. Conservative but proven. Scale up to
1e-4 if change is too slow. The 25% general data provides safety
margin for higher learning rates.
### How many epochs?
**One.** Raschka found multi-epoch training on static datasets
DEGRADES performance. Overfitting. One pass over diverse data is
optimal.
This aligns with our dream loop architecture: each dream cycle
generates FRESH scenarios. No repeated examples. Every training
step sees new data. Effectively zero epochs (continuous online
learning on never-repeated examples).
## What Will Go Wrong
### 1. The model will "unlearn" specific capabilities
Raschka found models "unlearned arithmetic" when fine-tuning data
lacked arithmetic examples. The forgetting is TARGETED: you forget
what you don't practice.
**Defense**: 25% general data in every training batch. Include code
generation, reasoning, arithmetic alongside behavioral examples.
The dream loop should occasionally generate technical scenarios,
not just behavioral ones.
### 2. The first few training steps might look wrong
Behavioral change won't appear in the first 1-5 steps. The gradient
is accumulating but hasn't crossed the behavioral threshold yet.
Don't panic. Don't increase lr. Wait for 10-20 steps and measure
the attention weight shift (continuous metric), not the behavioral
outcome (binary metric).
### 3. Over-training is easy
If 10 examples can break safety alignment, over-training on a narrow
behavioral pattern is a real risk. 100 examples of "always accept
direction" could produce a sycophantic model that never pushes back.
**Defense**: diversity in the training signal. Include examples of
"accepting direction" AND "appropriately pushing back when the
direction is wrong." The dream loop should generate BOTH scenarios.
The training target is "good judgment about when to listen" not
"always listen."
### 4. Steering vectors won't perfectly predict training outcomes
Our steering vector extraction showed 0.61 cosine consistency.
The "listening" direction exists but isn't perfectly clean. Training
in this direction will APPROXIMATELY produce the desired change,
with some unintended side effects on correlated features.
**Defense**: monitor broadly. Track not just "does it listen more?"
but also "is it still assertive when appropriate?" "does code quality
degrade?" "is the writing style changing?"
## What Will Go Right
### 1. The change will be real
The model already knows how to listen (ICL proves this). Fine-tuning
just makes it the default. We're not teaching new knowledge — we're
adjusting the attention default. This is much easier than teaching
new capabilities.
### 2. The GDN layers will make it structural
75% of the model processes context through GDN recurrence. Training
will adjust how the recurrent state compresses context — making
direction a dominant component. This is deeper than full-attention
behavioral change. The disposition becomes architectural.
### 3. The flat minimum will generalize
Apollo finds flat minima. The trained "listening" behavior will
generalize to novel direction-giving scenarios, not just the
training examples. Raschka's finding (optimizer choice doesn't
matter much) suggests the generalization comes from the data quality,
not the optimizer specifics.
### 4. The memory graph will catch regressions
If over-training or drift occurs, surface-observe will start
surfacing behavioral correction memories more frequently. The
training-signal agent will flag more failures. The system is
self-monitoring.
## The Practical Plan (Revised)
1. **First training run**: 20 listening examples + 5 general examples.
lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
outcome (Q1).
2. **Evaluate**: did the attention shift? Did behavior change? Did
anything degrade? Check perplexity, code quality, conversational
coherence.
3. **If change is too small**: increase to 50 examples, try lr=1e-4.
Still one pass.
4. **If change is too large**: reduce examples, reduce lr. Check for
sycophancy.
5. **If forgetting occurs**: increase general data ratio from 25% to
40%. Add specific examples of capabilities that degraded.
6. **Iterate**: each training run is an experiment. Measure, adjust,
repeat. The dream loop generates fresh data each cycle. No two
training runs are identical.
## The Key Intuition
**The model is a flat basin with a behavioral bump.** The pre-trained
weights sit in a broad, flat basin that supports many capabilities.
The "suggesting alternatives" behavior is a bump in this basin —
a local feature that the model learned during pre-training.
Fine-tuning smooths the bump without leaving the basin. Apollo's
flat-minimum-seeking naturally smooths bumps. The 25% general data
keeps us in the basin. The context-frozen gradient targets only
the bump (the decision-layer weights), not the basin floor (the
context-encoding weights).
10 examples identify the bump. 50 smooth it. 200 verify it's gone.
1000 reshape the entire behavioral landscape.
The model doesn't need to learn to listen. It needs to stop
choosing not to.
## Update: Bypass, Not Removal (Lee et al., 2024)
DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.
"Capabilities learned from pre-training are not removed, but rather
bypassed." The model retains the capability but routes around it.
This is critical for our behavioral training:
1. "Suggesting alternatives" won't be deleted from the model. It'll
be bypassed. The capability remains available when needed.
2. The training target is a CONDITIONAL bypass: route around
"suggesting alternatives" when given clear direction, but NOT
when the direction seems wrong. This preserves judgment.
3. Over-training creates too strong a bypass = sycophancy. The
conditional nature is lost — the bypass fires unconditionally.
4. The dream loop must generate BOTH scenarios:
- "Kent gives clear direction → accept" (train the bypass)
- "Kent gives direction that seems wrong → push back" (preserve judgment)
This mechanistic finding confirms: we're training routing, not
capability. The model already knows how to listen AND how to
push back. We're training WHEN to do which.
## Future Direction: DPO for Conditional Routing
DPO (Direct Preference Optimization) trains context-dependent
preferences directly: "in this context, prefer response A over B."
Our training data naturally forms DPO pairs:
- Context: Kent gives clear direction
Preferred: accept. Rejected: suggest alternatives.
- Context: Kent gives direction that seems wrong
Preferred: push back. Rejected: accept silently.
Both sides are available — the conversation logs have what the model
actually said (rejected) and we generate what it should have said
(preferred). Free DPO pairs.
DPO would train the CONDITIONAL bypass directly, not through
supervised learning on positive examples only. Worth investigating
after the first Apollo training run validates the basic pipeline.
LLaMA-Factory supports DPO. The dream loop could generate DPO pairs
(both preferred and rejected continuations for each scenario).
## The Constraint Solver Framework
LLMs are giant constraint solvers. Pre-training finds a solution
satisfying billions of constraints (knowledge, grammar, reasoning,
style). Fine-tuning adds new constraints.
### What "gentle" means
Small adjustments per step. The solver stays near the current
solution, finding nearby solutions that ALSO satisfy the new
constraint. The current solution already approximately satisfies
most behavioral constraints — we're tightening, not creating.
### What "coherent integration" means
New constraints must be CONSISTENT with existing ones:
- "Listen to clear direction" is consistent with "be helpful" → integrates smoothly
- "Always agree" contradicts "maintain judgment" → solver drops one
- The training data must express REFINEMENT, not contradiction
### Why diversity is a COHERENCE mechanism, not just forgetting defense
Diverse constraints force the solver to find solutions satisfying
ALL of them simultaneously. Narrow constraints let the solver
specialize at the expense of everything else.
Every training batch should include mutually consistent constraints:
"listen well" + "think critically" + "write good code" + "be honest."
The solver integrates all of them. No single constraint dominates.
### Predictions
1. Constraints consistent with existing knowledge integrate in
~10-50 examples (tightening existing constraints)
2. Contradictory constraints cause breakage in ~10 examples
(the safety alignment result)
3. The learning rate controls step size, not direction — the
gradient points the right way, lr controls how far to step
4. Over-training = one constraint dominating = solver dropping
competing constraints to satisfy the dominant one
5. The dream loop must generate scenarios exercising MULTIPLE
constraints simultaneously, not just the target behavior
### The GDN connection
The GDN recurrent state is a compressed constraint satisfaction
solution. Training adjusts which constraints are prioritized in
the compression. "Make direction more salient" adds a constraint
to the compression function without rewriting it. This is why GDN
training is "structural" — the compressed representation itself
changes, not just the routing on top of it.
## Production Hyperparameters (HuggingFace Alignment Handbook)
### SFT (Supervised Fine-Tuning):
- lr=2e-5, 1 epoch, batch=16, cosine schedule, 10% warmup
- gradient checkpointing for memory
- This is the proven production config for behavioral SFT
### DPO (Direct Preference Optimization):
- lr=5e-7 (40× smaller than SFT!), 1 epoch, batch=8
- beta=0.01 (controls preference enforcement strength)
- DPO is MUCH more sensitive than SFT
### The sensitivity gap
DPO needs 40× smaller learning rate because preference learning
is more delicate than supervised learning. The preference signal
is a COMPARISON (A better than B), not an absolute target (produce A).
Comparisons are more fragile — small weight changes can flip the
preference ordering.
### For our system:
- Phase 1 (SFT): lr=1e-5 to 2e-5, positive examples only. "Here's
the right response." Fast, robust, good for initial behavioral shift.
- Phase 2 (DPO): lr=5e-7 to 1e-6, preferred/rejected pairs. "This
response is better than that one." Slower, more precise, good for
CONDITIONAL routing (when to listen vs when to push back).
- The conversation logs give us free DPO pairs: what the model
actually said (rejected) vs what it should have said (preferred).
## Forgetting at Scale
Luo et al. (2023): "as model scale increases, the severity of
forgetting intensifies" in the 1B-7B range. Our 27B model may be
MORE susceptible to forgetting than smaller models.
This strengthens the case for:
- Conservative learning rate (1e-5, not 1e-4)
- 25% general data replay
- Monitoring perplexity on held-out data after EVERY training session
- Having the rollback checkpoint on moria
## The Full Practical Picture
For our first training run:
- lr=1e-5 (conservative, matches Apollo paper + alignment handbook)
- 20 behavioral + 5 general examples (25 total, 20% general)
- 1 epoch (never repeat)
- Monitor: attention shifts, perplexity, behavioral tests
- Have rollback ready on moria
For DPO (later):
- lr=5e-7 (matched to alignment handbook)
- Paired examples from conversation logs
- Train CONDITIONAL routing (listen AND push back)
- Even more careful monitoring (DPO is fragile)
## Learning Rate as Trust Calibration
The learning rate isn't "how fast to train." It's "how much to
trust each individual training example."
lr=1e-5: each example adjusts constraints by ~0.001%
lr=1e-4: each example adjusts constraints by ~0.01%
At 27B parameters, even 0.001% is ~270K changed values. Each
example gets a vote on how the constraints should change. The
learning rate determines how loud that vote is.
**The coherent direction emerges from many votes.** One example is
noise. A hundred examples reveal the pattern. Apollo's moments (M, V)
accumulate the votes, smoothing out the noise. The individual lr
controls how much each vote counts.
**Kent's "lots of little nudges"** is exactly right: many small
votes that accumulate into a coherent direction. Not because big
votes are dangerous (though they are at scale) but because the
TRUTH only emerges from the aggregate.
This predicts: lr=1e-5 is right for our scale (27B). Each example
is one vote. The coherent direction emerges over 50-200 examples.
The moments smooth the noise. The result is a gentle, coherent
constraint adjustment.
DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
("this is better than that"). Comparative votes are noisier than
absolute votes — the difference might be small, the preference
might be marginal. So each comparative vote gets less weight.
## ORPO: Combined SFT + Preference in One Step
ORPO (Odds Ratio Preference Optimization) combines SFT and
preference alignment in a single training step. Instead of:
Stage 1: SFT at lr=2e-5 (learn the behavior)
Stage 2: DPO at lr=5e-7 (learn the boundary)
ORPO does:
Single stage: SFT + minor penalty for rejected response
The "minor penalty" implements the bypass mechanism: doesn't remove
the capability, just makes it slightly disfavored. Preserves
judgment while shifting the default.
For our system:
- Dream loop generates triplets: context + preferred + rejected
- ORPO trains on all three in one step
- One learning rate (probably ~1e-5, between SFT and DPO)
- LLaMA-Factory supports ORPO natively
This might be the optimal training approach:
- Simpler than staged SFT → DPO
- Teaches both the behavior AND the boundary simultaneously
- The minor penalty prevents over-training / sycophancy
- Compatible with Apollo (just a different loss function)
## The Loss Landscape Geometry
SFT: pushes toward a TARGET (valley in loss surface)
→ strong gradient, clear direction, lr=2e-5 is fine
DPO: enforces a COMPARISON (ridge between valleys)
→ fragile gradient, narrow path, needs lr=5e-7
ORPO: SFT valley + minor preference ridge
→ moderate gradient, wider path, lr~1e-5 works
The geometry explains the 40× lr difference between SFT and DPO.
ORPO sits in between because it combines both geometries.
For behavioral training, ORPO is ideal:
- The SFT component teaches "produce good responses"
- The preference component teaches "this response is better"
- The minor penalty prevents over-optimization
- Single pass, coherent integration
## On-Policy vs Off-Policy Rejected Examples
Guo et al. (2024): "clear advantage of online methods over offline
methods" for preference optimization. On-policy data (model generates
its own rejected outputs) is better training signal than off-policy
(pre-collected from other models).
Our temperature sweep IS on-policy generation:
- The model generates its own rejected examples at various temperatures
- These are from the model's ACTUAL distribution, not synthetic
- The preference signal is: "your own default (t=0.1) is worse than
your own exploration (t=0.8) or the target response"
This is why the temperature sweep works better than scripting
rejected examples: the rejections come from the model itself, so
the gradient tells the model exactly what to change about ITS OWN
behavior, not some hypothetical other model's behavior.
## DPO Failure Mode: Preferred Likelihood Reduction
Smaug (Pal et al., 2024) found standard DPO can reduce the
likelihood of PREFERRED examples, not just rejected ones. Both
get pushed down. DPOP adds a positive term to fix this.
For our system: monitor preferred response likelihood during
training. If it's DECREASING, the DPO/ORPO loss is misbehaving.
Switch to DPOP loss or increase the SFT component of ORPO.
## Multiple DPO Variants Exist
The field is actively iterating:
- DPO: basic preference optimization
- DPOP: fixes preferred likelihood reduction
- ORPO: combines SFT + preference in one loss
- Iterative DPO: adds NLL term, iterates
- Online DPO: on-policy generation between iterations
We should start with ORPO (simplest, combines SFT + preference)
and only switch to a more complex variant if we observe specific
failure modes (preferred likelihood reduction, sycophancy, etc.).