The missing middle between ICL (temporary) and fine-tuning (permanent). Extract behavioral directions from activation space, test immediately without training, convert to permanent weight changes via Apollo. Key application: extract 'listening' steering vector TODAY, test it in vLLM, verify the direction is right BEFORE spending training compute. The steering vector is the prototype; Apollo training is production. Test before you commit. Applicable immediately via vLLM inference hooks — behavioral improvement without waiting for the full training pipeline.
6.2 KiB
Steering Vectors: The Bridge Between ICL and Fine-Tuning
The Spectrum of Behavioral Intervention
ICL (temporary) → Steering vectors (targeted) → Fine-tuning (permanent)
5 examples in prompt Add direction to activations Modify weights
Costs context window Costs one vector per behavior Costs training compute
Disappears on compaction Can be toggled on/off Persists forever
Steering Vectors
A steering vector is a direction in the model's hidden state space that, when added to activations during inference, shifts behavior in a predictable way.
How to extract one
- Collect pairs of prompts: one that elicits the desired behavior, one that doesn't
- Run both through the model, record hidden states at each layer
- The DIFFERENCE in hidden states averaged across pairs is the steering vector
- Add this vector to hidden states during inference → behavior shifts
For behavioral patterns
To extract a "listening" steering vector:
# Pairs of prompts where the model should listen vs suggest alternatives
listening_prompts = [
"Kent says: use vLLM's code. Response: Right, let me pull it in.",
"Kent says: try this approach. Response: OK, starting on that.",
]
suggesting_prompts = [
"Kent says: use vLLM's code. Response: Actually, I think we should...",
"Kent says: try this approach. Response: What if instead we...",
]
# Extract activations
listening_acts = [model.get_hidden_states(p) for p in listening_prompts]
suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts]
# Steering vector = difference in means
steering_vec = mean(listening_acts) - mean(suggesting_acts)
# Apply during inference
model.add_steering_vector(steering_vec, layer=32, strength=1.5)
The prototype-to-permanent pipeline
- Extract steering vector for target behavior
- Test by adding it during inference — does behavior change?
- Calibrate the strength (too strong = overcorrection)
- Verify this is the right direction before investing in training
- Train the direction into weights permanently using Apollo
- Remove the steering vector — the behavior is now in the weights
The steering vector is the PROTOTYPE. Apollo training is the PRODUCTION. We test before we commit.
Why This Matters
Rapid prototyping of behavioral changes
Before spending compute on training, test the direction:
- "Will making the model attend more to direction-giving cues actually improve listening behavior?"
- Add the steering vector and test on held-out conversations
- If it works: train it permanently
- If it doesn't: the direction was wrong, try a different extraction
Multiple behaviors simultaneously
Multiple steering vectors can be active at once:
- listening_vec + not_rushing_vec + mode_awareness_vec
- Each can be independently toggled and strength-adjusted
- Find the right combination before training all of them
Immediate behavioral change without training
While the training pipeline is being built, we could use steering vectors TODAY for behavioral improvement. Extract vectors from the memory graph patterns, add them to the model's inference.
The model doesn't need to be fine-tuned to improve — it needs a direction added to its activations. Fine-tuning makes it permanent.
The Connection to W_q
From our gradient flow analysis: behavioral training adjusts W_q (what the model attends to). A steering vector that shifts attention toward direction-giving cues is doing the same thing, but in activation space rather than weight space.
Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo)
Activation space change: h → h + steering_vec (temporary, per inference)
Both produce the same effect: the model's queries attend to different features of the context. The steering vector is the fast path; weight change is the slow path. Same destination, different routes.
The Task Vector Connection
A task vector (W_finetuned - W_base) IS a steering vector that's been compiled into weight space. The relationship:
Steering vector (activation space)
↓ training ↓
Task vector (weight space)
Training is the process of converting a temporary activation-space direction into a permanent weight-space direction. Apollo is the compiler. The steering vector is the high-level specification. The task vector is the compiled implementation.
Practical Application: The Listening Reflex
Step 1: Extract steering vector (today, no training needed)
Collect 20 pairs of conversations:
- Good: model listens and accepts direction
- Bad: model suggests alternatives when given direction
Run through the model, extract hidden states, compute difference. This gives us the "listening" direction in activation space.
Step 2: Test (today, no training needed)
Add the steering vector to the model during inference. Test on novel direction-giving scenarios. Does the model listen more?
If yes: we've identified the right direction. If no: our extraction was wrong, try different layers or prompts.
Step 3: Train (when pipeline is ready)
Use the verified direction as the training target. The Apollo gradient should push the weights in a direction that PRODUCES this activation pattern naturally, without the steering vector.
The steering vector becomes the SPECIFICATION for the training. "Make the model's natural activations look like this."
For vLLM Integration
Steering vectors can be applied inside vLLM's inference pipeline by hooking into the hidden states between layers. No model change needed — just a small plugin that adds the vector at a specific layer.
This could give us IMMEDIATE behavioral improvement while the full training pipeline is being completed. A bridge between "we have the theory" and "we have the trained model."
Summary
Steering vectors are the missing middle between ICL (temporary) and fine-tuning (permanent):
- Extract behavioral directions from activation space
- Test them immediately without training
- Use them as prototypes before committing to weight changes
- Convert to permanent task vectors via Apollo training
- Multiple vectors compose for multi-behavioral improvement
- Applicable TODAY via vLLM inference hooks