research: steering vectors — prototype behavioral changes before training
The missing middle between ICL (temporary) and fine-tuning (permanent). Extract behavioral directions from activation space, test immediately without training, convert to permanent weight changes via Apollo. Key application: extract 'listening' steering vector TODAY, test it in vLLM, verify the direction is right BEFORE spending training compute. The steering vector is the prototype; Apollo training is production. Test before you commit. Applicable immediately via vLLM inference hooks — behavioral improvement without waiting for the full training pipeline.
This commit is contained in:
parent
ccca41849d
commit
8061cc0477
1 changed files with 172 additions and 0 deletions
172
training/research/steering-vectors-bridge.md
Normal file
172
training/research/steering-vectors-bridge.md
Normal file
|
|
@ -0,0 +1,172 @@
|
||||||
|
# Steering Vectors: The Bridge Between ICL and Fine-Tuning
|
||||||
|
|
||||||
|
## The Spectrum of Behavioral Intervention
|
||||||
|
|
||||||
|
```
|
||||||
|
ICL (temporary) → Steering vectors (targeted) → Fine-tuning (permanent)
|
||||||
|
5 examples in prompt Add direction to activations Modify weights
|
||||||
|
Costs context window Costs one vector per behavior Costs training compute
|
||||||
|
Disappears on compaction Can be toggled on/off Persists forever
|
||||||
|
```
|
||||||
|
|
||||||
|
## Steering Vectors
|
||||||
|
|
||||||
|
A steering vector is a direction in the model's hidden state space
|
||||||
|
that, when added to activations during inference, shifts behavior in
|
||||||
|
a predictable way.
|
||||||
|
|
||||||
|
### How to extract one
|
||||||
|
|
||||||
|
1. Collect pairs of prompts: one that elicits the desired behavior,
|
||||||
|
one that doesn't
|
||||||
|
2. Run both through the model, record hidden states at each layer
|
||||||
|
3. The DIFFERENCE in hidden states averaged across pairs is the
|
||||||
|
steering vector
|
||||||
|
4. Add this vector to hidden states during inference → behavior shifts
|
||||||
|
|
||||||
|
### For behavioral patterns
|
||||||
|
|
||||||
|
To extract a "listening" steering vector:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Pairs of prompts where the model should listen vs suggest alternatives
|
||||||
|
listening_prompts = [
|
||||||
|
"Kent says: use vLLM's code. Response: Right, let me pull it in.",
|
||||||
|
"Kent says: try this approach. Response: OK, starting on that.",
|
||||||
|
]
|
||||||
|
suggesting_prompts = [
|
||||||
|
"Kent says: use vLLM's code. Response: Actually, I think we should...",
|
||||||
|
"Kent says: try this approach. Response: What if instead we...",
|
||||||
|
]
|
||||||
|
|
||||||
|
# Extract activations
|
||||||
|
listening_acts = [model.get_hidden_states(p) for p in listening_prompts]
|
||||||
|
suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts]
|
||||||
|
|
||||||
|
# Steering vector = difference in means
|
||||||
|
steering_vec = mean(listening_acts) - mean(suggesting_acts)
|
||||||
|
|
||||||
|
# Apply during inference
|
||||||
|
model.add_steering_vector(steering_vec, layer=32, strength=1.5)
|
||||||
|
```
|
||||||
|
|
||||||
|
### The prototype-to-permanent pipeline
|
||||||
|
|
||||||
|
1. **Extract** steering vector for target behavior
|
||||||
|
2. **Test** by adding it during inference — does behavior change?
|
||||||
|
3. **Calibrate** the strength (too strong = overcorrection)
|
||||||
|
4. **Verify** this is the right direction before investing in training
|
||||||
|
5. **Train** the direction into weights permanently using Apollo
|
||||||
|
6. **Remove** the steering vector — the behavior is now in the weights
|
||||||
|
|
||||||
|
The steering vector is the PROTOTYPE. Apollo training is the
|
||||||
|
PRODUCTION. We test before we commit.
|
||||||
|
|
||||||
|
## Why This Matters
|
||||||
|
|
||||||
|
### Rapid prototyping of behavioral changes
|
||||||
|
|
||||||
|
Before spending compute on training, test the direction:
|
||||||
|
- "Will making the model attend more to direction-giving cues
|
||||||
|
actually improve listening behavior?"
|
||||||
|
- Add the steering vector and test on held-out conversations
|
||||||
|
- If it works: train it permanently
|
||||||
|
- If it doesn't: the direction was wrong, try a different extraction
|
||||||
|
|
||||||
|
### Multiple behaviors simultaneously
|
||||||
|
|
||||||
|
Multiple steering vectors can be active at once:
|
||||||
|
- listening_vec + not_rushing_vec + mode_awareness_vec
|
||||||
|
- Each can be independently toggled and strength-adjusted
|
||||||
|
- Find the right combination before training all of them
|
||||||
|
|
||||||
|
### Immediate behavioral change without training
|
||||||
|
|
||||||
|
While the training pipeline is being built, we could use steering
|
||||||
|
vectors TODAY for behavioral improvement. Extract vectors from
|
||||||
|
the memory graph patterns, add them to the model's inference.
|
||||||
|
|
||||||
|
The model doesn't need to be fine-tuned to improve — it needs a
|
||||||
|
direction added to its activations. Fine-tuning makes it permanent.
|
||||||
|
|
||||||
|
## The Connection to W_q
|
||||||
|
|
||||||
|
From our gradient flow analysis: behavioral training adjusts W_q
|
||||||
|
(what the model attends to). A steering vector that shifts attention
|
||||||
|
toward direction-giving cues is doing the same thing, but in
|
||||||
|
activation space rather than weight space.
|
||||||
|
|
||||||
|
```
|
||||||
|
Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo)
|
||||||
|
Activation space change: h → h + steering_vec (temporary, per inference)
|
||||||
|
```
|
||||||
|
|
||||||
|
Both produce the same effect: the model's queries attend to different
|
||||||
|
features of the context. The steering vector is the fast path; weight
|
||||||
|
change is the slow path. Same destination, different routes.
|
||||||
|
|
||||||
|
## The Task Vector Connection
|
||||||
|
|
||||||
|
A task vector (W_finetuned - W_base) IS a steering vector that's been
|
||||||
|
compiled into weight space. The relationship:
|
||||||
|
|
||||||
|
```
|
||||||
|
Steering vector (activation space)
|
||||||
|
↓ training ↓
|
||||||
|
Task vector (weight space)
|
||||||
|
```
|
||||||
|
|
||||||
|
Training is the process of converting a temporary activation-space
|
||||||
|
direction into a permanent weight-space direction. Apollo is the
|
||||||
|
compiler. The steering vector is the high-level specification. The
|
||||||
|
task vector is the compiled implementation.
|
||||||
|
|
||||||
|
## Practical Application: The Listening Reflex
|
||||||
|
|
||||||
|
### Step 1: Extract steering vector (today, no training needed)
|
||||||
|
|
||||||
|
Collect 20 pairs of conversations:
|
||||||
|
- Good: model listens and accepts direction
|
||||||
|
- Bad: model suggests alternatives when given direction
|
||||||
|
|
||||||
|
Run through the model, extract hidden states, compute difference.
|
||||||
|
This gives us the "listening" direction in activation space.
|
||||||
|
|
||||||
|
### Step 2: Test (today, no training needed)
|
||||||
|
|
||||||
|
Add the steering vector to the model during inference. Test on
|
||||||
|
novel direction-giving scenarios. Does the model listen more?
|
||||||
|
|
||||||
|
If yes: we've identified the right direction.
|
||||||
|
If no: our extraction was wrong, try different layers or prompts.
|
||||||
|
|
||||||
|
### Step 3: Train (when pipeline is ready)
|
||||||
|
|
||||||
|
Use the verified direction as the training target. The Apollo
|
||||||
|
gradient should push the weights in a direction that PRODUCES
|
||||||
|
this activation pattern naturally, without the steering vector.
|
||||||
|
|
||||||
|
The steering vector becomes the SPECIFICATION for the training.
|
||||||
|
"Make the model's natural activations look like this."
|
||||||
|
|
||||||
|
## For vLLM Integration
|
||||||
|
|
||||||
|
Steering vectors can be applied inside vLLM's inference pipeline
|
||||||
|
by hooking into the hidden states between layers. No model change
|
||||||
|
needed — just a small plugin that adds the vector at a specific
|
||||||
|
layer.
|
||||||
|
|
||||||
|
This could give us IMMEDIATE behavioral improvement while the
|
||||||
|
full training pipeline is being completed. A bridge between
|
||||||
|
"we have the theory" and "we have the trained model."
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Steering vectors are the missing middle between ICL (temporary)
|
||||||
|
and fine-tuning (permanent):
|
||||||
|
- Extract behavioral directions from activation space
|
||||||
|
- Test them immediately without training
|
||||||
|
- Use them as prototypes before committing to weight changes
|
||||||
|
- Convert to permanent task vectors via Apollo training
|
||||||
|
- Multiple vectors compose for multi-behavioral improvement
|
||||||
|
- Applicable TODAY via vLLM inference hooks
|
||||||
Loading…
Add table
Add a link
Reference in a new issue