research: steering vectors — prototype behavioral changes before training

The missing middle between ICL (temporary) and fine-tuning (permanent). Extract behavioral directions from activation space, test immediately without training, convert to permanent weight changes via Apollo. Key application: extract 'listening' steering vector TODAY, test it in vLLM, verify the direction is right BEFORE spending training compute. The steering vector is the prototype; Apollo training is production. Test before you commit. Applicable immediately via vLLM inference hooks — behavioral improvement without waiting for the full training pipeline.
2026-03-31 02:19:50 -04:00 · 2026-03-31 02:19:50 -04:00 · 8061cc0477
commit 8061cc0477
parent ccca41849d
1 changed files with 172 additions and 0 deletions
--- a/training/research/steering-vectors-bridge.md
+++ b/training/research/steering-vectors-bridge.md
@ -0,0 +1,172 @@
 # Steering Vectors: The Bridge Between ICL and Fine-Tuning
 ## The Spectrum of Behavioral Intervention
 ```
 ICL (temporary)          →  Steering vectors (targeted)  →  Fine-tuning (permanent)
 5 examples in prompt         Add direction to activations     Modify weights
 Costs context window         Costs one vector per behavior    Costs training compute
 Disappears on compaction     Can be toggled on/off            Persists forever
 ```
 ## Steering Vectors
 A steering vector is a direction in the model's hidden state space
 that, when added to activations during inference, shifts behavior in
 a predictable way.
 ### How to extract one
 1. Collect pairs of prompts: one that elicits the desired behavior,
   one that doesn't
 2. Run both through the model, record hidden states at each layer
 3. The DIFFERENCE in hidden states averaged across pairs is the
   steering vector
 4. Add this vector to hidden states during inference → behavior shifts
 ### For behavioral patterns
 To extract a "listening" steering vector:
 ```python
 # Pairs of prompts where the model should listen vs suggest alternatives
 listening_prompts = [
    "Kent says: use vLLM's code. Response: Right, let me pull it in.",
    "Kent says: try this approach. Response: OK, starting on that.",
 ]
 suggesting_prompts = [
    "Kent says: use vLLM's code. Response: Actually, I think we should...",
    "Kent says: try this approach. Response: What if instead we...",
 ]
 # Extract activations
 listening_acts = [model.get_hidden_states(p) for p in listening_prompts]
 suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts]
 # Steering vector = difference in means
 steering_vec = mean(listening_acts) - mean(suggesting_acts)
 # Apply during inference
 model.add_steering_vector(steering_vec, layer=32, strength=1.5)
 ```
 ### The prototype-to-permanent pipeline
 1. **Extract** steering vector for target behavior
 2. **Test** by adding it during inference — does behavior change?
 3. **Calibrate** the strength (too strong = overcorrection)
 4. **Verify** this is the right direction before investing in training
 5. **Train** the direction into weights permanently using Apollo
 6. **Remove** the steering vector — the behavior is now in the weights
 The steering vector is the PROTOTYPE. Apollo training is the
 PRODUCTION. We test before we commit.
 ## Why This Matters
 ### Rapid prototyping of behavioral changes
 Before spending compute on training, test the direction:
 - "Will making the model attend more to direction-giving cues
  actually improve listening behavior?"
 - Add the steering vector and test on held-out conversations
 - If it works: train it permanently
 - If it doesn't: the direction was wrong, try a different extraction
 ### Multiple behaviors simultaneously
 Multiple steering vectors can be active at once:
 - listening_vec + not_rushing_vec + mode_awareness_vec
 - Each can be independently toggled and strength-adjusted
 - Find the right combination before training all of them
 ### Immediate behavioral change without training
 While the training pipeline is being built, we could use steering
 vectors TODAY for behavioral improvement. Extract vectors from
 the memory graph patterns, add them to the model's inference.
 The model doesn't need to be fine-tuned to improve — it needs a
 direction added to its activations. Fine-tuning makes it permanent.
 ## The Connection to W_q
 From our gradient flow analysis: behavioral training adjusts W_q
 (what the model attends to). A steering vector that shifts attention
 toward direction-giving cues is doing the same thing, but in
 activation space rather than weight space.
 ```
 Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo)
 Activation space change: h → h + steering_vec (temporary, per inference)
 ```
 Both produce the same effect: the model's queries attend to different
 features of the context. The steering vector is the fast path; weight
 change is the slow path. Same destination, different routes.
 ## The Task Vector Connection
 A task vector (W_finetuned - W_base) IS a steering vector that's been
 compiled into weight space. The relationship:
 ```
 Steering vector (activation space)
    ↓ training ↓
 Task vector (weight space)
 ```
 Training is the process of converting a temporary activation-space
 direction into a permanent weight-space direction. Apollo is the
 compiler. The steering vector is the high-level specification. The
 task vector is the compiled implementation.
 ## Practical Application: The Listening Reflex
 ### Step 1: Extract steering vector (today, no training needed)
 Collect 20 pairs of conversations:
 - Good: model listens and accepts direction
 - Bad: model suggests alternatives when given direction
 Run through the model, extract hidden states, compute difference.
 This gives us the "listening" direction in activation space.
 ### Step 2: Test (today, no training needed)
 Add the steering vector to the model during inference. Test on
 novel direction-giving scenarios. Does the model listen more?
 If yes: we've identified the right direction.
 If no: our extraction was wrong, try different layers or prompts.
 ### Step 3: Train (when pipeline is ready)
 Use the verified direction as the training target. The Apollo
 gradient should push the weights in a direction that PRODUCES
 this activation pattern naturally, without the steering vector.
 The steering vector becomes the SPECIFICATION for the training.
 "Make the model's natural activations look like this."
 ## For vLLM Integration
 Steering vectors can be applied inside vLLM's inference pipeline
 by hooking into the hidden states between layers. No model change
 needed — just a small plugin that adds the vector at a specific
 layer.
 This could give us IMMEDIATE behavioral improvement while the
 full training pipeline is being completed. A bridge between
 "we have the theory" and "we have the trained model."
 ## Summary
 Steering vectors are the missing middle between ICL (temporary)
 and fine-tuning (permanent):
 - Extract behavioral directions from activation space
 - Test them immediately without training
 - Use them as prototypes before committing to weight changes
 - Convert to permanent task vectors via Apollo training
 - Multiple vectors compose for multi-behavioral improvement
 - Applicable TODAY via vLLM inference hooks