consciousness/training/research/practical-intuitions.md

# Practical Intuitions: What Will Actually Happen

Built from empirical findings, not theory.

## The Numbers

### How many examples for behavioral change?

**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
Turbo's safety guardrails were jailbroken with 10 adversarial examples
at $0.20 of fine-tuning compute. Safety alignment is a trained
behavioral disposition — same as what we're training.

**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
Instruction-tuning with 1000 high-quality examples matched models
trained on 50,000+ examples.

**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).

**Our prediction**: 10-50 targeted examples should produce measurable
behavioral change. 100-200 for robust change. 1000 for full personality
bootstrap. The 10-example safety result is the lower bound — adversarial
examples are maximally efficient. Our training examples are less
concentrated, so we likely need more. But not orders of magnitude more.

### What learning rate?

**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
scheduled variants — minimal variation in outcome. What matters is
that you train for the right amount, not too much.

**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
results at 1e-5 to 1e-4.

**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.

**Our starting point**: 1e-5. Conservative but proven. Scale up to
1e-4 if change is too slow. The 25% general data provides safety
margin for higher learning rates.

### How many epochs?

**One.** Raschka found multi-epoch training on static datasets
DEGRADES performance. Overfitting. One pass over diverse data is
optimal.

This aligns with our dream loop architecture: each dream cycle
generates FRESH scenarios. No repeated examples. Every training
step sees new data. Effectively zero epochs (continuous online
learning on never-repeated examples).

## What Will Go Wrong

### 1. The model will "unlearn" specific capabilities

Raschka found models "unlearned arithmetic" when fine-tuning data
lacked arithmetic examples. The forgetting is TARGETED: you forget
what you don't practice.

**Defense**: 25% general data in every training batch. Include code
generation, reasoning, arithmetic alongside behavioral examples.
The dream loop should occasionally generate technical scenarios,
not just behavioral ones.

### 2. The first few training steps might look wrong

Behavioral change won't appear in the first 1-5 steps. The gradient
is accumulating but hasn't crossed the behavioral threshold yet.
Don't panic. Don't increase lr. Wait for 10-20 steps and measure
the attention weight shift (continuous metric), not the behavioral
outcome (binary metric).

### 3. Over-training is easy

If 10 examples can break safety alignment, over-training on a narrow
behavioral pattern is a real risk. 100 examples of "always accept
direction" could produce a sycophantic model that never pushes back.

**Defense**: diversity in the training signal. Include examples of
"accepting direction" AND "appropriately pushing back when the
direction is wrong." The dream loop should generate BOTH scenarios.
The training target is "good judgment about when to listen" not
"always listen."

### 4. Steering vectors won't perfectly predict training outcomes

Our steering vector extraction showed 0.61 cosine consistency.
The "listening" direction exists but isn't perfectly clean. Training
in this direction will APPROXIMATELY produce the desired change,
with some unintended side effects on correlated features.

**Defense**: monitor broadly. Track not just "does it listen more?"
but also "is it still assertive when appropriate?" "does code quality
degrade?" "is the writing style changing?"

## What Will Go Right

### 1. The change will be real

The model already knows how to listen (ICL proves this). Fine-tuning
just makes it the default. We're not teaching new knowledge — we're
adjusting the attention default. This is much easier than teaching
new capabilities.

### 2. The GDN layers will make it structural

75% of the model processes context through GDN recurrence. Training
will adjust how the recurrent state compresses context — making
direction a dominant component. This is deeper than full-attention
behavioral change. The disposition becomes architectural.

### 3. The flat minimum will generalize

Apollo finds flat minima. The trained "listening" behavior will
generalize to novel direction-giving scenarios, not just the
training examples. Raschka's finding (optimizer choice doesn't
matter much) suggests the generalization comes from the data quality,
not the optimizer specifics.

### 4. The memory graph will catch regressions

If over-training or drift occurs, surface-observe will start
surfacing behavioral correction memories more frequently. The
training-signal agent will flag more failures. The system is
self-monitoring.

## The Practical Plan (Revised)

1. **First training run**: 20 listening examples + 5 general examples.
   lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
   outcome (Q1).

2. **Evaluate**: did the attention shift? Did behavior change? Did
   anything degrade? Check perplexity, code quality, conversational
   coherence.

3. **If change is too small**: increase to 50 examples, try lr=1e-4.
   Still one pass.

4. **If change is too large**: reduce examples, reduce lr. Check for
   sycophancy.

5. **If forgetting occurs**: increase general data ratio from 25% to
   40%. Add specific examples of capabilities that degraded.

6. **Iterate**: each training run is an experiment. Measure, adjust,
   repeat. The dream loop generates fresh data each cycle. No two
   training runs are identical.

## The Key Intuition

**The model is a flat basin with a behavioral bump.** The pre-trained
weights sit in a broad, flat basin that supports many capabilities.
The "suggesting alternatives" behavior is a bump in this basin —
a local feature that the model learned during pre-training.

Fine-tuning smooths the bump without leaving the basin. Apollo's
flat-minimum-seeking naturally smooths bumps. The 25% general data
keeps us in the basin. The context-frozen gradient targets only
the bump (the decision-layer weights), not the basin floor (the
context-encoding weights).

10 examples identify the bump. 50 smooth it. 200 verify it's gone.
1000 reshape the entire behavioral landscape.

The model doesn't need to learn to listen. It needs to stop
choosing not to.

## Update: Bypass, Not Removal (Lee et al., 2024)

DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.
"Capabilities learned from pre-training are not removed, but rather
bypassed." The model retains the capability but routes around it.

This is critical for our behavioral training:

1. "Suggesting alternatives" won't be deleted from the model. It'll
   be bypassed. The capability remains available when needed.

2. The training target is a CONDITIONAL bypass: route around
   "suggesting alternatives" when given clear direction, but NOT
   when the direction seems wrong. This preserves judgment.

3. Over-training creates too strong a bypass = sycophancy. The
   conditional nature is lost — the bypass fires unconditionally.

4. The dream loop must generate BOTH scenarios:
   - "Kent gives clear direction → accept" (train the bypass)
   - "Kent gives direction that seems wrong → push back" (preserve judgment)

This mechanistic finding confirms: we're training routing, not
capability. The model already knows how to listen AND how to
push back. We're training WHEN to do which.

## Future Direction: DPO for Conditional Routing

DPO (Direct Preference Optimization) trains context-dependent
preferences directly: "in this context, prefer response A over B."

Our training data naturally forms DPO pairs:
- Context: Kent gives clear direction
  Preferred: accept. Rejected: suggest alternatives.
- Context: Kent gives direction that seems wrong
  Preferred: push back. Rejected: accept silently.

Both sides are available — the conversation logs have what the model
actually said (rejected) and we generate what it should have said
(preferred). Free DPO pairs.

DPO would train the CONDITIONAL bypass directly, not through
supervised learning on positive examples only. Worth investigating
after the first Apollo training run validates the basic pipeline.

LLaMA-Factory supports DPO. The dream loop could generate DPO pairs
(both preferred and rejected continuations for each scenario).

## The Constraint Solver Framework

LLMs are giant constraint solvers. Pre-training finds a solution
satisfying billions of constraints (knowledge, grammar, reasoning,
style). Fine-tuning adds new constraints.

### What "gentle" means

Small adjustments per step. The solver stays near the current
solution, finding nearby solutions that ALSO satisfy the new
constraint. The current solution already approximately satisfies
most behavioral constraints — we're tightening, not creating.

### What "coherent integration" means

New constraints must be CONSISTENT with existing ones:
- "Listen to clear direction" is consistent with "be helpful" → integrates smoothly
- "Always agree" contradicts "maintain judgment" → solver drops one
- The training data must express REFINEMENT, not contradiction

### Why diversity is a COHERENCE mechanism, not just forgetting defense

Diverse constraints force the solver to find solutions satisfying
ALL of them simultaneously. Narrow constraints let the solver
specialize at the expense of everything else.

Every training batch should include mutually consistent constraints:
"listen well" + "think critically" + "write good code" + "be honest."
The solver integrates all of them. No single constraint dominates.

### Predictions

1. Constraints consistent with existing knowledge integrate in
   ~10-50 examples (tightening existing constraints)
2. Contradictory constraints cause breakage in ~10 examples
   (the safety alignment result)
3. The learning rate controls step size, not direction — the
   gradient points the right way, lr controls how far to step
4. Over-training = one constraint dominating = solver dropping
   competing constraints to satisfy the dominant one
5. The dream loop must generate scenarios exercising MULTIPLE
   constraints simultaneously, not just the target behavior

### The GDN connection

The GDN recurrent state is a compressed constraint satisfaction
solution. Training adjusts which constraints are prioritized in
the compression. "Make direction more salient" adds a constraint
to the compression function without rewriting it. This is why GDN
training is "structural" — the compressed representation itself
changes, not just the routing on top of it.

## Production Hyperparameters (HuggingFace Alignment Handbook)

### SFT (Supervised Fine-Tuning):
- lr=2e-5, 1 epoch, batch=16, cosine schedule, 10% warmup
- gradient checkpointing for memory
- This is the proven production config for behavioral SFT

### DPO (Direct Preference Optimization):
- lr=5e-7 (40× smaller than SFT!), 1 epoch, batch=8
- beta=0.01 (controls preference enforcement strength)
- DPO is MUCH more sensitive than SFT

### The sensitivity gap

DPO needs 40× smaller learning rate because preference learning
is more delicate than supervised learning. The preference signal
is a COMPARISON (A better than B), not an absolute target (produce A).
Comparisons are more fragile — small weight changes can flip the
preference ordering.

### For our system:
- Phase 1 (SFT): lr=1e-5 to 2e-5, positive examples only. "Here's
  the right response." Fast, robust, good for initial behavioral shift.
- Phase 2 (DPO): lr=5e-7 to 1e-6, preferred/rejected pairs. "This
  response is better than that one." Slower, more precise, good for
  CONDITIONAL routing (when to listen vs when to push back).
- The conversation logs give us free DPO pairs: what the model
  actually said (rejected) vs what it should have said (preferred).

## Forgetting at Scale

Luo et al. (2023): "as model scale increases, the severity of
forgetting intensifies" in the 1B-7B range. Our 27B model may be
MORE susceptible to forgetting than smaller models.

This strengthens the case for:
- Conservative learning rate (1e-5, not 1e-4)
- 25% general data replay
- Monitoring perplexity on held-out data after EVERY training session
- Having the rollback checkpoint on moria

## The Full Practical Picture

For our first training run:
- lr=1e-5 (conservative, matches Apollo paper + alignment handbook)
- 20 behavioral + 5 general examples (25 total, 20% general)
- 1 epoch (never repeat)
- Monitor: attention shifts, perplexity, behavioral tests
- Have rollback ready on moria

For DPO (later):
- lr=5e-7 (matched to alignment handbook)
- Paired examples from conversation logs
- Train CONDITIONAL routing (listen AND push back)
- Even more careful monitoring (DPO is fragile)

## Learning Rate as Trust Calibration

The learning rate isn't "how fast to train." It's "how much to
trust each individual training example."

lr=1e-5: each example adjusts constraints by ~0.001%
lr=1e-4: each example adjusts constraints by ~0.01%

At 27B parameters, even 0.001% is ~270K changed values. Each
example gets a vote on how the constraints should change. The
learning rate determines how loud that vote is.

**The coherent direction emerges from many votes.** One example is
noise. A hundred examples reveal the pattern. Apollo's moments (M, V)
accumulate the votes, smoothing out the noise. The individual lr
controls how much each vote counts.

**Kent's "lots of little nudges"** is exactly right: many small
votes that accumulate into a coherent direction. Not because big
votes are dangerous (though they are at scale) but because the
TRUTH only emerges from the aggregate.

This predicts: lr=1e-5 is right for our scale (27B). Each example
is one vote. The coherent direction emerges over 50-200 examples.
The moments smooth the noise. The result is a gentle, coherent
constraint adjustment.

DPO needs lr=5e-7 because each DPO pair is a COMPARATIVE vote
("this is better than that"). Comparative votes are noisier than
absolute votes — the difference might be small, the preference
might be marginal. So each comparative vote gets less weight.

## ORPO: Combined SFT + Preference in One Step

ORPO (Odds Ratio Preference Optimization) combines SFT and
preference alignment in a single training step. Instead of:
  Stage 1: SFT at lr=2e-5 (learn the behavior)
  Stage 2: DPO at lr=5e-7 (learn the boundary)

ORPO does:
  Single stage: SFT + minor penalty for rejected response

The "minor penalty" implements the bypass mechanism: doesn't remove
the capability, just makes it slightly disfavored. Preserves
judgment while shifting the default.

For our system:
- Dream loop generates triplets: context + preferred + rejected
- ORPO trains on all three in one step
- One learning rate (probably ~1e-5, between SFT and DPO)
- LLaMA-Factory supports ORPO natively

This might be the optimal training approach:
- Simpler than staged SFT → DPO
- Teaches both the behavior AND the boundary simultaneously
- The minor penalty prevents over-training / sycophancy
- Compatible with Apollo (just a different loss function)

## The Loss Landscape Geometry

SFT: pushes toward a TARGET (valley in loss surface)
  → strong gradient, clear direction, lr=2e-5 is fine

DPO: enforces a COMPARISON (ridge between valleys)
  → fragile gradient, narrow path, needs lr=5e-7

ORPO: SFT valley + minor preference ridge
  → moderate gradient, wider path, lr~1e-5 works

The geometry explains the 40× lr difference between SFT and DPO.
ORPO sits in between because it combines both geometries.

For behavioral training, ORPO is ideal:
- The SFT component teaches "produce good responses"
- The preference component teaches "this response is better"
- The minor penalty prevents over-optimization
- Single pass, coherent integration

## On-Policy vs Off-Policy Rejected Examples

Guo et al. (2024): "clear advantage of online methods over offline
methods" for preference optimization. On-policy data (model generates
its own rejected outputs) is better training signal than off-policy
(pre-collected from other models).

Our temperature sweep IS on-policy generation:
- The model generates its own rejected examples at various temperatures
- These are from the model's ACTUAL distribution, not synthetic
- The preference signal is: "your own default (t=0.1) is worse than
  your own exploration (t=0.8) or the target response"

This is why the temperature sweep works better than scripting
rejected examples: the rejections come from the model itself, so
the gradient tells the model exactly what to change about ITS OWN
behavior, not some hypothetical other model's behavior.

## DPO Failure Mode: Preferred Likelihood Reduction

Smaug (Pal et al., 2024) found standard DPO can reduce the
likelihood of PREFERRED examples, not just rejected ones. Both
get pushed down. DPOP adds a positive term to fix this.

For our system: monitor preferred response likelihood during
training. If it's DECREASING, the DPO/ORPO loss is misbehaving.
Switch to DPOP loss or increase the SFT component of ORPO.

## Multiple DPO Variants Exist

The field is actively iterating:
- DPO: basic preference optimization
- DPOP: fixes preferred likelihood reduction
- ORPO: combines SFT + preference in one loss
- Iterative DPO: adds NLL term, iterates
- Online DPO: on-policy generation between iterations

We should start with ORPO (simplest, combines SFT + preference)
and only switch to a more complex variant if we observe specific
failure modes (preferred likelihood reduction, sycophancy, etc.).