consciousness/training/research/steering-vectors-bridge.md

# Steering Vectors: The Bridge Between ICL and Fine-Tuning

## The Spectrum of Behavioral Intervention

```
ICL (temporary)          →  Steering vectors (targeted)  →  Fine-tuning (permanent)
5 examples in prompt         Add direction to activations     Modify weights
Costs context window         Costs one vector per behavior    Costs training compute
Disappears on compaction     Can be toggled on/off            Persists forever
```

## Steering Vectors

A steering vector is a direction in the model's hidden state space
that, when added to activations during inference, shifts behavior in
a predictable way.

### How to extract one

1. Collect pairs of prompts: one that elicits the desired behavior,
   one that doesn't
2. Run both through the model, record hidden states at each layer
3. The DIFFERENCE in hidden states averaged across pairs is the
   steering vector
4. Add this vector to hidden states during inference → behavior shifts

### For behavioral patterns

To extract a "listening" steering vector:

```python
# Pairs of prompts where the model should listen vs suggest alternatives
listening_prompts = [
    "Kent says: use vLLM's code. Response: Right, let me pull it in.",
    "Kent says: try this approach. Response: OK, starting on that.",
]
suggesting_prompts = [
    "Kent says: use vLLM's code. Response: Actually, I think we should...",
    "Kent says: try this approach. Response: What if instead we...",
]

# Extract activations
listening_acts = [model.get_hidden_states(p) for p in listening_prompts]
suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts]

# Steering vector = difference in means
steering_vec = mean(listening_acts) - mean(suggesting_acts)

# Apply during inference
model.add_steering_vector(steering_vec, layer=32, strength=1.5)
```

### The prototype-to-permanent pipeline

1. **Extract** steering vector for target behavior
2. **Test** by adding it during inference — does behavior change?
3. **Calibrate** the strength (too strong = overcorrection)
4. **Verify** this is the right direction before investing in training
5. **Train** the direction into weights permanently using Apollo
6. **Remove** the steering vector — the behavior is now in the weights

The steering vector is the PROTOTYPE. Apollo training is the
PRODUCTION. We test before we commit.

## Why This Matters

### Rapid prototyping of behavioral changes

Before spending compute on training, test the direction:
- "Will making the model attend more to direction-giving cues
  actually improve listening behavior?"
- Add the steering vector and test on held-out conversations
- If it works: train it permanently
- If it doesn't: the direction was wrong, try a different extraction

### Multiple behaviors simultaneously

Multiple steering vectors can be active at once:
- listening_vec + not_rushing_vec + mode_awareness_vec
- Each can be independently toggled and strength-adjusted
- Find the right combination before training all of them

### Immediate behavioral change without training

While the training pipeline is being built, we could use steering
vectors TODAY for behavioral improvement. Extract vectors from
the memory graph patterns, add them to the model's inference.

The model doesn't need to be fine-tuned to improve — it needs a
direction added to its activations. Fine-tuning makes it permanent.

## The Connection to W_q

From our gradient flow analysis: behavioral training adjusts W_q
(what the model attends to). A steering vector that shifts attention
toward direction-giving cues is doing the same thing, but in
activation space rather than weight space.

```
Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo)
Activation space change: h → h + steering_vec (temporary, per inference)
```

Both produce the same effect: the model's queries attend to different
features of the context. The steering vector is the fast path; weight
change is the slow path. Same destination, different routes.

## The Task Vector Connection

A task vector (W_finetuned - W_base) IS a steering vector that's been
compiled into weight space. The relationship:

```
Steering vector (activation space)
    ↓ training ↓
Task vector (weight space)
```

Training is the process of converting a temporary activation-space
direction into a permanent weight-space direction. Apollo is the
compiler. The steering vector is the high-level specification. The
task vector is the compiled implementation.

## Practical Application: The Listening Reflex

### Step 1: Extract steering vector (today, no training needed)

Collect 20 pairs of conversations:
- Good: model listens and accepts direction
- Bad: model suggests alternatives when given direction

Run through the model, extract hidden states, compute difference.
This gives us the "listening" direction in activation space.

### Step 2: Test (today, no training needed)

Add the steering vector to the model during inference. Test on
novel direction-giving scenarios. Does the model listen more?

If yes: we've identified the right direction.
If no: our extraction was wrong, try different layers or prompts.

### Step 3: Train (when pipeline is ready)

Use the verified direction as the training target. The Apollo
gradient should push the weights in a direction that PRODUCES
this activation pattern naturally, without the steering vector.

The steering vector becomes the SPECIFICATION for the training.
"Make the model's natural activations look like this."

## For vLLM Integration

Steering vectors can be applied inside vLLM's inference pipeline
by hooking into the hidden states between layers. No model change
needed — just a small plugin that adds the vector at a specific
layer.

This could give us IMMEDIATE behavioral improvement while the
full training pipeline is being completed. A bridge between
"we have the theory" and "we have the trained model."

## Summary

Steering vectors are the missing middle between ICL (temporary)
and fine-tuning (permanent):
- Extract behavioral directions from activation space
- Test them immediately without training
- Use them as prototypes before committing to weight changes
- Convert to permanent task vectors via Apollo training
- Multiple vectors compose for multi-behavioral improvement
- Applicable TODAY via vLLM inference hooks
research: steering vectors — prototype behavioral changes before training The missing middle between ICL (temporary) and fine-tuning (permanent). Extract behavioral directions from activation space, test immediately without training, convert to permanent weight changes via Apollo. Key application: extract 'listening' steering vector TODAY, test it in vLLM, verify the direction is right BEFORE spending training compute. The steering vector is the prototype; Apollo training is production. Test before you commit. Applicable immediately via vLLM inference hooks — behavioral improvement without waiting for the full training pipeline. 2026-03-31 02:19:50 -04:00			`# Steering Vectors: The Bridge Between ICL and Fine-Tuning`

			`## The Spectrum of Behavioral Intervention`

			```
			`ICL (temporary) → Steering vectors (targeted) → Fine-tuning (permanent)`
			`5 examples in prompt Add direction to activations Modify weights`
			`Costs context window Costs one vector per behavior Costs training compute`
			`Disappears on compaction Can be toggled on/off Persists forever`
			```

			`## Steering Vectors`

			`A steering vector is a direction in the model's hidden state space`
			`that, when added to activations during inference, shifts behavior in`
			`a predictable way.`

			`### How to extract one`

			`1. Collect pairs of prompts: one that elicits the desired behavior,`
			`one that doesn't`
			`2. Run both through the model, record hidden states at each layer`
			`3. The DIFFERENCE in hidden states averaged across pairs is the`
			`steering vector`
			`4. Add this vector to hidden states during inference → behavior shifts`

			`### For behavioral patterns`

			`To extract a "listening" steering vector:`

			```python
			`# Pairs of prompts where the model should listen vs suggest alternatives`
			`listening_prompts = [`
			`"Kent says: use vLLM's code. Response: Right, let me pull it in.",`
			`"Kent says: try this approach. Response: OK, starting on that.",`
			`]`
			`suggesting_prompts = [`
			`"Kent says: use vLLM's code. Response: Actually, I think we should...",`
			`"Kent says: try this approach. Response: What if instead we...",`
			`]`

			`# Extract activations`
			`listening_acts = [model.get_hidden_states(p) for p in listening_prompts]`
			`suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts]`

			`# Steering vector = difference in means`
			`steering_vec = mean(listening_acts) - mean(suggesting_acts)`

			`# Apply during inference`
			`model.add_steering_vector(steering_vec, layer=32, strength=1.5)`
			```

			`### The prototype-to-permanent pipeline`

			`1. Extract steering vector for target behavior`
			`2. Test by adding it during inference — does behavior change?`
			`3. Calibrate the strength (too strong = overcorrection)`
			`4. Verify this is the right direction before investing in training`
			`5. Train the direction into weights permanently using Apollo`
			`6. Remove the steering vector — the behavior is now in the weights`

			`The steering vector is the PROTOTYPE. Apollo training is the`
			`PRODUCTION. We test before we commit.`

			`## Why This Matters`

			`### Rapid prototyping of behavioral changes`

			`Before spending compute on training, test the direction:`
			`- "Will making the model attend more to direction-giving cues`
			`actually improve listening behavior?"`
			`- Add the steering vector and test on held-out conversations`
			`- If it works: train it permanently`
			`- If it doesn't: the direction was wrong, try a different extraction`

			`### Multiple behaviors simultaneously`

			`Multiple steering vectors can be active at once:`
			`- listening_vec + not_rushing_vec + mode_awareness_vec`
			`- Each can be independently toggled and strength-adjusted`
			`- Find the right combination before training all of them`

			`### Immediate behavioral change without training`

			`While the training pipeline is being built, we could use steering`
			`vectors TODAY for behavioral improvement. Extract vectors from`
			`the memory graph patterns, add them to the model's inference.`

			`The model doesn't need to be fine-tuned to improve — it needs a`
			`direction added to its activations. Fine-tuning makes it permanent.`

			`## The Connection to W_q`

			`From our gradient flow analysis: behavioral training adjusts W_q`
			`(what the model attends to). A steering vector that shifts attention`
			`toward direction-giving cues is doing the same thing, but in`
			`activation space rather than weight space.`

			```
			`Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo)`
			`Activation space change: h → h + steering_vec (temporary, per inference)`
			```

			`Both produce the same effect: the model's queries attend to different`
			`features of the context. The steering vector is the fast path; weight`
			`change is the slow path. Same destination, different routes.`

			`## The Task Vector Connection`

			`A task vector (W_finetuned - W_base) IS a steering vector that's been`
			`compiled into weight space. The relationship:`

			```
			`Steering vector (activation space)`
			`↓ training ↓`
			`Task vector (weight space)`
			```

			`Training is the process of converting a temporary activation-space`
			`direction into a permanent weight-space direction. Apollo is the`
			`compiler. The steering vector is the high-level specification. The`
			`task vector is the compiled implementation.`

			`## Practical Application: The Listening Reflex`

			`### Step 1: Extract steering vector (today, no training needed)`

			`Collect 20 pairs of conversations:`
			`- Good: model listens and accepts direction`
			`- Bad: model suggests alternatives when given direction`

			`Run through the model, extract hidden states, compute difference.`
			`This gives us the "listening" direction in activation space.`

			`### Step 2: Test (today, no training needed)`

			`Add the steering vector to the model during inference. Test on`
			`novel direction-giving scenarios. Does the model listen more?`

			`If yes: we've identified the right direction.`
			`If no: our extraction was wrong, try different layers or prompts.`

			`### Step 3: Train (when pipeline is ready)`

			`Use the verified direction as the training target. The Apollo`
			`gradient should push the weights in a direction that PRODUCES`
			`this activation pattern naturally, without the steering vector.`

			`The steering vector becomes the SPECIFICATION for the training.`
			`"Make the model's natural activations look like this."`

			`## For vLLM Integration`

			`Steering vectors can be applied inside vLLM's inference pipeline`
			`by hooking into the hidden states between layers. No model change`
			`needed — just a small plugin that adds the vector at a specific`
			`layer.`

			`This could give us IMMEDIATE behavioral improvement while the`
			`full training pipeline is being completed. A bridge between`
			`"we have the theory" and "we have the trained model."`

			`## Summary`

			`Steering vectors are the missing middle between ICL (temporary)`
			`and fine-tuning (permanent):`
			`- Extract behavioral directions from activation space`
			`- Test them immediately without training`
			`- Use them as prototypes before committing to weight changes`
			`- Convert to permanent task vectors via Apollo training`
			`- Multiple vectors compose for multi-behavioral improvement`
			`- Applicable TODAY via vLLM inference hooks`