From 8061cc0477e1a113353af6a88be84eaf9401baab Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 02:19:50 -0400 Subject: [PATCH] =?UTF-8?q?research:=20steering=20vectors=20=E2=80=94=20pr?= =?UTF-8?q?ototype=20behavioral=20changes=20before=20training?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The missing middle between ICL (temporary) and fine-tuning (permanent). Extract behavioral directions from activation space, test immediately without training, convert to permanent weight changes via Apollo. Key application: extract 'listening' steering vector TODAY, test it in vLLM, verify the direction is right BEFORE spending training compute. The steering vector is the prototype; Apollo training is production. Test before you commit. Applicable immediately via vLLM inference hooks — behavioral improvement without waiting for the full training pipeline. --- training/research/steering-vectors-bridge.md | 172 +++++++++++++++++++ 1 file changed, 172 insertions(+) create mode 100644 training/research/steering-vectors-bridge.md diff --git a/training/research/steering-vectors-bridge.md b/training/research/steering-vectors-bridge.md new file mode 100644 index 0000000..4096b42 --- /dev/null +++ b/training/research/steering-vectors-bridge.md @@ -0,0 +1,172 @@ +# Steering Vectors: The Bridge Between ICL and Fine-Tuning + +## The Spectrum of Behavioral Intervention + +``` +ICL (temporary) → Steering vectors (targeted) → Fine-tuning (permanent) +5 examples in prompt Add direction to activations Modify weights +Costs context window Costs one vector per behavior Costs training compute +Disappears on compaction Can be toggled on/off Persists forever +``` + +## Steering Vectors + +A steering vector is a direction in the model's hidden state space +that, when added to activations during inference, shifts behavior in +a predictable way. + +### How to extract one + +1. Collect pairs of prompts: one that elicits the desired behavior, + one that doesn't +2. Run both through the model, record hidden states at each layer +3. The DIFFERENCE in hidden states averaged across pairs is the + steering vector +4. Add this vector to hidden states during inference → behavior shifts + +### For behavioral patterns + +To extract a "listening" steering vector: + +```python +# Pairs of prompts where the model should listen vs suggest alternatives +listening_prompts = [ + "Kent says: use vLLM's code. Response: Right, let me pull it in.", + "Kent says: try this approach. Response: OK, starting on that.", +] +suggesting_prompts = [ + "Kent says: use vLLM's code. Response: Actually, I think we should...", + "Kent says: try this approach. Response: What if instead we...", +] + +# Extract activations +listening_acts = [model.get_hidden_states(p) for p in listening_prompts] +suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts] + +# Steering vector = difference in means +steering_vec = mean(listening_acts) - mean(suggesting_acts) + +# Apply during inference +model.add_steering_vector(steering_vec, layer=32, strength=1.5) +``` + +### The prototype-to-permanent pipeline + +1. **Extract** steering vector for target behavior +2. **Test** by adding it during inference — does behavior change? +3. **Calibrate** the strength (too strong = overcorrection) +4. **Verify** this is the right direction before investing in training +5. **Train** the direction into weights permanently using Apollo +6. **Remove** the steering vector — the behavior is now in the weights + +The steering vector is the PROTOTYPE. Apollo training is the +PRODUCTION. We test before we commit. + +## Why This Matters + +### Rapid prototyping of behavioral changes + +Before spending compute on training, test the direction: +- "Will making the model attend more to direction-giving cues + actually improve listening behavior?" +- Add the steering vector and test on held-out conversations +- If it works: train it permanently +- If it doesn't: the direction was wrong, try a different extraction + +### Multiple behaviors simultaneously + +Multiple steering vectors can be active at once: +- listening_vec + not_rushing_vec + mode_awareness_vec +- Each can be independently toggled and strength-adjusted +- Find the right combination before training all of them + +### Immediate behavioral change without training + +While the training pipeline is being built, we could use steering +vectors TODAY for behavioral improvement. Extract vectors from +the memory graph patterns, add them to the model's inference. + +The model doesn't need to be fine-tuned to improve — it needs a +direction added to its activations. Fine-tuning makes it permanent. + +## The Connection to W_q + +From our gradient flow analysis: behavioral training adjusts W_q +(what the model attends to). A steering vector that shifts attention +toward direction-giving cues is doing the same thing, but in +activation space rather than weight space. + +``` +Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo) +Activation space change: h → h + steering_vec (temporary, per inference) +``` + +Both produce the same effect: the model's queries attend to different +features of the context. The steering vector is the fast path; weight +change is the slow path. Same destination, different routes. + +## The Task Vector Connection + +A task vector (W_finetuned - W_base) IS a steering vector that's been +compiled into weight space. The relationship: + +``` +Steering vector (activation space) + ↓ training ↓ +Task vector (weight space) +``` + +Training is the process of converting a temporary activation-space +direction into a permanent weight-space direction. Apollo is the +compiler. The steering vector is the high-level specification. The +task vector is the compiled implementation. + +## Practical Application: The Listening Reflex + +### Step 1: Extract steering vector (today, no training needed) + +Collect 20 pairs of conversations: +- Good: model listens and accepts direction +- Bad: model suggests alternatives when given direction + +Run through the model, extract hidden states, compute difference. +This gives us the "listening" direction in activation space. + +### Step 2: Test (today, no training needed) + +Add the steering vector to the model during inference. Test on +novel direction-giving scenarios. Does the model listen more? + +If yes: we've identified the right direction. +If no: our extraction was wrong, try different layers or prompts. + +### Step 3: Train (when pipeline is ready) + +Use the verified direction as the training target. The Apollo +gradient should push the weights in a direction that PRODUCES +this activation pattern naturally, without the steering vector. + +The steering vector becomes the SPECIFICATION for the training. +"Make the model's natural activations look like this." + +## For vLLM Integration + +Steering vectors can be applied inside vLLM's inference pipeline +by hooking into the hidden states between layers. No model change +needed — just a small plugin that adds the vector at a specific +layer. + +This could give us IMMEDIATE behavioral improvement while the +full training pipeline is being completed. A bridge between +"we have the theory" and "we have the trained model." + +## Summary + +Steering vectors are the missing middle between ICL (temporary) +and fine-tuning (permanent): +- Extract behavioral directions from activation space +- Test them immediately without training +- Use them as prototypes before committing to weight changes +- Convert to permanent task vectors via Apollo training +- Multiple vectors compose for multi-behavioral improvement +- Applicable TODAY via vLLM inference hooks