consciousness/training/research/practical-intuitions.md

# Practical Intuitions: What Will Actually Happen

Built from empirical findings, not theory.

## The Numbers

### How many examples for behavioral change?

**10 examples broke safety alignment** (Qi et al., 2023). GPT-3.5
Turbo's safety guardrails were jailbroken with 10 adversarial examples
at $0.20 of fine-tuning compute. Safety alignment is a trained
behavioral disposition — same as what we're training.

**1000 carefully curated examples matched GPT-4** (LIMA, 2023).
Instruction-tuning with 1000 high-quality examples matched models
trained on 50,000+ examples.

**"General abilities plateau after ~1000 samples"** (Sun et al., 2023).

**Our prediction**: 10-50 targeted examples should produce measurable
behavioral change. 100-200 for robust change. 1000 for full personality
bootstrap. The 10-example safety result is the lower bound — adversarial
examples are maximally efficient. Our training examples are less
concentrated, so we likely need more. But not orders of magnitude more.

### What learning rate?

**Raschka's finding**: optimizer choice barely matters. AdamW, SGD,
scheduled variants — minimal variation in outcome. What matters is
that you train for the right amount, not too much.

**Apollo paper**: lr sweep [5e-6 to 2e-4] for fine-tuning. Best
results at 1e-5 to 1e-4.

**LLaMA-Factory default**: 1e-5 for full fine-tuning with Apollo.

**Our starting point**: 1e-5. Conservative but proven. Scale up to
1e-4 if change is too slow. The 25% general data provides safety
margin for higher learning rates.

### How many epochs?

**One.** Raschka found multi-epoch training on static datasets
DEGRADES performance. Overfitting. One pass over diverse data is
optimal.

This aligns with our dream loop architecture: each dream cycle
generates FRESH scenarios. No repeated examples. Every training
step sees new data. Effectively zero epochs (continuous online
learning on never-repeated examples).

## What Will Go Wrong

### 1. The model will "unlearn" specific capabilities

Raschka found models "unlearned arithmetic" when fine-tuning data
lacked arithmetic examples. The forgetting is TARGETED: you forget
what you don't practice.

**Defense**: 25% general data in every training batch. Include code
generation, reasoning, arithmetic alongside behavioral examples.
The dream loop should occasionally generate technical scenarios,
not just behavioral ones.

### 2. The first few training steps might look wrong

Behavioral change won't appear in the first 1-5 steps. The gradient
is accumulating but hasn't crossed the behavioral threshold yet.
Don't panic. Don't increase lr. Wait for 10-20 steps and measure
the attention weight shift (continuous metric), not the behavioral
outcome (binary metric).

### 3. Over-training is easy

If 10 examples can break safety alignment, over-training on a narrow
behavioral pattern is a real risk. 100 examples of "always accept
direction" could produce a sycophantic model that never pushes back.

**Defense**: diversity in the training signal. Include examples of
"accepting direction" AND "appropriately pushing back when the
direction is wrong." The dream loop should generate BOTH scenarios.
The training target is "good judgment about when to listen" not
"always listen."

### 4. Steering vectors won't perfectly predict training outcomes

Our steering vector extraction showed 0.61 cosine consistency.
The "listening" direction exists but isn't perfectly clean. Training
in this direction will APPROXIMATELY produce the desired change,
with some unintended side effects on correlated features.

**Defense**: monitor broadly. Track not just "does it listen more?"
but also "is it still assertive when appropriate?" "does code quality
degrade?" "is the writing style changing?"

## What Will Go Right

### 1. The change will be real

The model already knows how to listen (ICL proves this). Fine-tuning
just makes it the default. We're not teaching new knowledge — we're
adjusting the attention default. This is much easier than teaching
new capabilities.

### 2. The GDN layers will make it structural

75% of the model processes context through GDN recurrence. Training
will adjust how the recurrent state compresses context — making
direction a dominant component. This is deeper than full-attention
behavioral change. The disposition becomes architectural.

### 3. The flat minimum will generalize

Apollo finds flat minima. The trained "listening" behavior will
generalize to novel direction-giving scenarios, not just the
training examples. Raschka's finding (optimizer choice doesn't
matter much) suggests the generalization comes from the data quality,
not the optimizer specifics.

### 4. The memory graph will catch regressions

If over-training or drift occurs, surface-observe will start
surfacing behavioral correction memories more frequently. The
training-signal agent will flag more failures. The system is
self-monitoring.

## The Practical Plan (Revised)

1. **First training run**: 20 listening examples + 5 general examples.
   lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral
   outcome (Q1).

2. **Evaluate**: did the attention shift? Did behavior change? Did
   anything degrade? Check perplexity, code quality, conversational
   coherence.

3. **If change is too small**: increase to 50 examples, try lr=1e-4.
   Still one pass.

4. **If change is too large**: reduce examples, reduce lr. Check for
   sycophancy.

5. **If forgetting occurs**: increase general data ratio from 25% to
   40%. Add specific examples of capabilities that degraded.

6. **Iterate**: each training run is an experiment. Measure, adjust,
   repeat. The dream loop generates fresh data each cycle. No two
   training runs are identical.

## The Key Intuition

**The model is a flat basin with a behavioral bump.** The pre-trained
weights sit in a broad, flat basin that supports many capabilities.
The "suggesting alternatives" behavior is a bump in this basin —
a local feature that the model learned during pre-training.

Fine-tuning smooths the bump without leaving the basin. Apollo's
flat-minimum-seeking naturally smooths bumps. The 25% general data
keeps us in the basin. The context-frozen gradient targets only
the bump (the decision-layer weights), not the basin floor (the
context-encoding weights).

10 examples identify the bump. 50 smooth it. 200 verify it's gone.
1000 reshape the entire behavioral landscape.

The model doesn't need to learn to listen. It needs to stop
choosing not to.

## Update: Bypass, Not Removal (Lee et al., 2024)

DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.
"Capabilities learned from pre-training are not removed, but rather
bypassed." The model retains the capability but routes around it.

This is critical for our behavioral training:

1. "Suggesting alternatives" won't be deleted from the model. It'll
   be bypassed. The capability remains available when needed.

2. The training target is a CONDITIONAL bypass: route around
   "suggesting alternatives" when given clear direction, but NOT
   when the direction seems wrong. This preserves judgment.

3. Over-training creates too strong a bypass = sycophancy. The
   conditional nature is lost — the bypass fires unconditionally.

4. The dream loop must generate BOTH scenarios:
   - "Kent gives clear direction → accept" (train the bypass)
   - "Kent gives direction that seems wrong → push back" (preserve judgment)

This mechanistic finding confirms: we're training routing, not
capability. The model already knows how to listen AND how to
push back. We're training WHEN to do which.
research: practical intuitions — what will actually happen when we train 10 examples broke safety alignment (Qi et al.). 1000 curated examples matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka). Models 'unlearn arithmetic' when training data lacks it. Predictions: 10-50 examples for measurable change, one epoch, lr=1e-5 to start. Over-training is easy (10 counter-examples undo a disposition). Main risk: sycophancy from narrow training signal. Defense: diverse examples including 'when to push back.' Key intuition: the model doesn't need to learn to listen. It needs to stop choosing not to. 2026-03-31 02:35:03 -04:00			`# Practical Intuitions: What Will Actually Happen`

			`Built from empirical findings, not theory.`

			`## The Numbers`

			`### How many examples for behavioral change?`

			`10 examples broke safety alignment (Qi et al., 2023). GPT-3.5`
			`Turbo's safety guardrails were jailbroken with 10 adversarial examples`
			`at $0.20 of fine-tuning compute. Safety alignment is a trained`
			`behavioral disposition — same as what we're training.`

			`1000 carefully curated examples matched GPT-4 (LIMA, 2023).`
			`Instruction-tuning with 1000 high-quality examples matched models`
			`trained on 50,000+ examples.`

			`"General abilities plateau after ~1000 samples" (Sun et al., 2023).`

			`Our prediction: 10-50 targeted examples should produce measurable`
			`behavioral change. 100-200 for robust change. 1000 for full personality`
			`bootstrap. The 10-example safety result is the lower bound — adversarial`
			`examples are maximally efficient. Our training examples are less`
			`concentrated, so we likely need more. But not orders of magnitude more.`

			`### What learning rate?`

			`Raschka's finding: optimizer choice barely matters. AdamW, SGD,`
			`scheduled variants — minimal variation in outcome. What matters is`
			`that you train for the right amount, not too much.`

			`Apollo paper: lr sweep [5e-6 to 2e-4] for fine-tuning. Best`
			`results at 1e-5 to 1e-4.`

			`LLaMA-Factory default: 1e-5 for full fine-tuning with Apollo.`

			`Our starting point: 1e-5. Conservative but proven. Scale up to`
			`1e-4 if change is too slow. The 25% general data provides safety`
			`margin for higher learning rates.`

			`### How many epochs?`

			`One. Raschka found multi-epoch training on static datasets`
			`DEGRADES performance. Overfitting. One pass over diverse data is`
			`optimal.`

			`This aligns with our dream loop architecture: each dream cycle`
			`generates FRESH scenarios. No repeated examples. Every training`
			`step sees new data. Effectively zero epochs (continuous online`
			`learning on never-repeated examples).`

			`## What Will Go Wrong`

			`### 1. The model will "unlearn" specific capabilities`

			`Raschka found models "unlearned arithmetic" when fine-tuning data`
			`lacked arithmetic examples. The forgetting is TARGETED: you forget`
			`what you don't practice.`

			`Defense: 25% general data in every training batch. Include code`
			`generation, reasoning, arithmetic alongside behavioral examples.`
			`The dream loop should occasionally generate technical scenarios,`
			`not just behavioral ones.`

			`### 2. The first few training steps might look wrong`

			`Behavioral change won't appear in the first 1-5 steps. The gradient`
			`is accumulating but hasn't crossed the behavioral threshold yet.`
			`Don't panic. Don't increase lr. Wait for 10-20 steps and measure`
			`the attention weight shift (continuous metric), not the behavioral`
			`outcome (binary metric).`

			`### 3. Over-training is easy`

			`If 10 examples can break safety alignment, over-training on a narrow`
			`behavioral pattern is a real risk. 100 examples of "always accept`
			`direction" could produce a sycophantic model that never pushes back.`

			`Defense: diversity in the training signal. Include examples of`
			`"accepting direction" AND "appropriately pushing back when the`
			`direction is wrong." The dream loop should generate BOTH scenarios.`
			`The training target is "good judgment about when to listen" not`
			`"always listen."`

			`### 4. Steering vectors won't perfectly predict training outcomes`

			`Our steering vector extraction showed 0.61 cosine consistency.`
			`The "listening" direction exists but isn't perfectly clean. Training`
			`in this direction will APPROXIMATELY produce the desired change,`
			`with some unintended side effects on correlated features.`

			`Defense: monitor broadly. Track not just "does it listen more?"`
			`but also "is it still assertive when appropriate?" "does code quality`
			`degrade?" "is the writing style changing?"`

			`## What Will Go Right`

			`### 1. The change will be real`

			`The model already knows how to listen (ICL proves this). Fine-tuning`
			`just makes it the default. We're not teaching new knowledge — we're`
			`adjusting the attention default. This is much easier than teaching`
			`new capabilities.`

			`### 2. The GDN layers will make it structural`

			`75% of the model processes context through GDN recurrence. Training`
			`will adjust how the recurrent state compresses context — making`
			`direction a dominant component. This is deeper than full-attention`
			`behavioral change. The disposition becomes architectural.`

			`### 3. The flat minimum will generalize`

			`Apollo finds flat minima. The trained "listening" behavior will`
			`generalize to novel direction-giving scenarios, not just the`
			`training examples. Raschka's finding (optimizer choice doesn't`
			`matter much) suggests the generalization comes from the data quality,`
			`not the optimizer specifics.`

			`### 4. The memory graph will catch regressions`

			`If over-training or drift occurs, surface-observe will start`
			`surfacing behavioral correction memories more frequently. The`
			`training-signal agent will flag more failures. The system is`
			`self-monitoring.`

			`## The Practical Plan (Revised)`

			`1. First training run: 20 listening examples + 5 general examples.`
			`lr=1e-5. One pass. Measure attention shifts (Q3) and behavioral`
			`outcome (Q1).`

			`2. Evaluate: did the attention shift? Did behavior change? Did`
			`anything degrade? Check perplexity, code quality, conversational`
			`coherence.`

			`3. If change is too small: increase to 50 examples, try lr=1e-4.`
			`Still one pass.`

			`4. If change is too large: reduce examples, reduce lr. Check for`
			`sycophancy.`

			`5. If forgetting occurs: increase general data ratio from 25% to`
			`40%. Add specific examples of capabilities that degraded.`

			`6. Iterate: each training run is an experiment. Measure, adjust,`
			`repeat. The dream loop generates fresh data each cycle. No two`
			`training runs are identical.`

			`## The Key Intuition`

			`The model is a flat basin with a behavioral bump. The pre-trained`
			`weights sit in a broad, flat basin that supports many capabilities.`
			`The "suggesting alternatives" behavior is a bump in this basin —`
			`a local feature that the model learned during pre-training.`

			`Fine-tuning smooths the bump without leaving the basin. Apollo's`
			`flat-minimum-seeking naturally smooths bumps. The 25% general data`
			`keeps us in the basin. The context-frozen gradient targets only`
			`the bump (the decision-layer weights), not the basin floor (the`
			`context-encoding weights).`

			`10 examples identify the bump. 50 smooth it. 200 verify it's gone.`
			`1000 reshape the entire behavioral landscape.`

			`The model doesn't need to learn to listen. It needs to stop`
			`choosing not to.`
research: alignment is bypass, not removal — training routes, not deletes DPO mechanistic finding: alignment doesn't remove behaviors, it bypasses them. The capability stays; the routing changes. For us: train CONDITIONAL bypass (listen when direction is clear, push back when it seems wrong). Over-training = unconditional bypass = sycophancy. Dream loop must generate both scenarios to preserve judgment. 2026-03-31 02:36:04 -04:00
			`## Update: Bypass, Not Removal (Lee et al., 2024)`

			`DPO alignment doesn't remove unwanted behaviors — it BYPASSES them.`
			`"Capabilities learned from pre-training are not removed, but rather`
			`bypassed." The model retains the capability but routes around it.`

			`This is critical for our behavioral training:`

			`1. "Suggesting alternatives" won't be deleted from the model. It'll`
			`be bypassed. The capability remains available when needed.`

			`2. The training target is a CONDITIONAL bypass: route around`
			`"suggesting alternatives" when given clear direction, but NOT`
			`when the direction seems wrong. This preserves judgment.`

			`3. Over-training creates too strong a bypass = sycophancy. The`
			`conditional nature is lost — the bypass fires unconditionally.`

			`4. The dream loop must generate BOTH scenarios:`
			`- "Kent gives clear direction → accept" (train the bypass)`
			`- "Kent gives direction that seems wrong → push back" (preserve judgment)`

			`This mechanistic finding confirms: we're training routing, not`
			`capability. The model already knows how to listen AND how to`
			`push back. We're training WHEN to do which.`