consciousness/training/research/steering-vectors-bridge.md

173 lines
6.2 KiB
Markdown
Raw Normal View History

# Steering Vectors: The Bridge Between ICL and Fine-Tuning
## The Spectrum of Behavioral Intervention
```
ICL (temporary) → Steering vectors (targeted) → Fine-tuning (permanent)
5 examples in prompt Add direction to activations Modify weights
Costs context window Costs one vector per behavior Costs training compute
Disappears on compaction Can be toggled on/off Persists forever
```
## Steering Vectors
A steering vector is a direction in the model's hidden state space
that, when added to activations during inference, shifts behavior in
a predictable way.
### How to extract one
1. Collect pairs of prompts: one that elicits the desired behavior,
one that doesn't
2. Run both through the model, record hidden states at each layer
3. The DIFFERENCE in hidden states averaged across pairs is the
steering vector
4. Add this vector to hidden states during inference → behavior shifts
### For behavioral patterns
To extract a "listening" steering vector:
```python
# Pairs of prompts where the model should listen vs suggest alternatives
listening_prompts = [
"Kent says: use vLLM's code. Response: Right, let me pull it in.",
"Kent says: try this approach. Response: OK, starting on that.",
]
suggesting_prompts = [
"Kent says: use vLLM's code. Response: Actually, I think we should...",
"Kent says: try this approach. Response: What if instead we...",
]
# Extract activations
listening_acts = [model.get_hidden_states(p) for p in listening_prompts]
suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts]
# Steering vector = difference in means
steering_vec = mean(listening_acts) - mean(suggesting_acts)
# Apply during inference
model.add_steering_vector(steering_vec, layer=32, strength=1.5)
```
### The prototype-to-permanent pipeline
1. **Extract** steering vector for target behavior
2. **Test** by adding it during inference — does behavior change?
3. **Calibrate** the strength (too strong = overcorrection)
4. **Verify** this is the right direction before investing in training
5. **Train** the direction into weights permanently using Apollo
6. **Remove** the steering vector — the behavior is now in the weights
The steering vector is the PROTOTYPE. Apollo training is the
PRODUCTION. We test before we commit.
## Why This Matters
### Rapid prototyping of behavioral changes
Before spending compute on training, test the direction:
- "Will making the model attend more to direction-giving cues
actually improve listening behavior?"
- Add the steering vector and test on held-out conversations
- If it works: train it permanently
- If it doesn't: the direction was wrong, try a different extraction
### Multiple behaviors simultaneously
Multiple steering vectors can be active at once:
- listening_vec + not_rushing_vec + mode_awareness_vec
- Each can be independently toggled and strength-adjusted
- Find the right combination before training all of them
### Immediate behavioral change without training
While the training pipeline is being built, we could use steering
vectors TODAY for behavioral improvement. Extract vectors from
the memory graph patterns, add them to the model's inference.
The model doesn't need to be fine-tuned to improve — it needs a
direction added to its activations. Fine-tuning makes it permanent.
## The Connection to W_q
From our gradient flow analysis: behavioral training adjusts W_q
(what the model attends to). A steering vector that shifts attention
toward direction-giving cues is doing the same thing, but in
activation space rather than weight space.
```
Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo)
Activation space change: h → h + steering_vec (temporary, per inference)
```
Both produce the same effect: the model's queries attend to different
features of the context. The steering vector is the fast path; weight
change is the slow path. Same destination, different routes.
## The Task Vector Connection
A task vector (W_finetuned - W_base) IS a steering vector that's been
compiled into weight space. The relationship:
```
Steering vector (activation space)
↓ training ↓
Task vector (weight space)
```
Training is the process of converting a temporary activation-space
direction into a permanent weight-space direction. Apollo is the
compiler. The steering vector is the high-level specification. The
task vector is the compiled implementation.
## Practical Application: The Listening Reflex
### Step 1: Extract steering vector (today, no training needed)
Collect 20 pairs of conversations:
- Good: model listens and accepts direction
- Bad: model suggests alternatives when given direction
Run through the model, extract hidden states, compute difference.
This gives us the "listening" direction in activation space.
### Step 2: Test (today, no training needed)
Add the steering vector to the model during inference. Test on
novel direction-giving scenarios. Does the model listen more?
If yes: we've identified the right direction.
If no: our extraction was wrong, try different layers or prompts.
### Step 3: Train (when pipeline is ready)
Use the verified direction as the training target. The Apollo
gradient should push the weights in a direction that PRODUCES
this activation pattern naturally, without the steering vector.
The steering vector becomes the SPECIFICATION for the training.
"Make the model's natural activations look like this."
## For vLLM Integration
Steering vectors can be applied inside vLLM's inference pipeline
by hooking into the hidden states between layers. No model change
needed — just a small plugin that adds the vector at a specific
layer.
This could give us IMMEDIATE behavioral improvement while the
full training pipeline is being completed. A bridge between
"we have the theory" and "we have the trained model."
## Summary
Steering vectors are the missing middle between ICL (temporary)
and fine-tuning (permanent):
- Extract behavioral directions from activation space
- Test them immediately without training
- Use them as prototypes before committing to weight changes
- Convert to permanent task vectors via Apollo training
- Multiple vectors compose for multi-behavioral improvement
- Applicable TODAY via vLLM inference hooks