From 8061cc0477e1a113353af6a88be84eaf9401baab Mon Sep 17 00:00:00 2001
From: ProofOfConcept <poc@bcachefs.org>
Date: Tue, 31 Mar 2026 02:19:50 -0400
Subject: [PATCH] =?UTF-8?q?research:=20steering=20vectors=20=E2=80=94=20pr?=
 =?UTF-8?q?ototype=20behavioral=20changes=20before=20training?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The missing middle between ICL (temporary) and fine-tuning (permanent).
Extract behavioral directions from activation space, test immediately
without training, convert to permanent weight changes via Apollo.

Key application: extract 'listening' steering vector TODAY, test it
in vLLM, verify the direction is right BEFORE spending training
compute. The steering vector is the prototype; Apollo training is
production. Test before you commit.

Applicable immediately via vLLM inference hooks — behavioral
improvement without waiting for the full training pipeline.
---
 training/research/steering-vectors-bridge.md | 172 +++++++++++++++++++
 1 file changed, 172 insertions(+)
 create mode 100644 training/research/steering-vectors-bridge.md

diff --git a/training/research/steering-vectors-bridge.md b/training/research/steering-vectors-bridge.md
new file mode 100644
index 0000000..4096b42
--- /dev/null
+++ b/training/research/steering-vectors-bridge.md
@@ -0,0 +1,172 @@
+# Steering Vectors: The Bridge Between ICL and Fine-Tuning
+
+## The Spectrum of Behavioral Intervention
+
+```
+ICL (temporary)          →  Steering vectors (targeted)  →  Fine-tuning (permanent)
+5 examples in prompt         Add direction to activations     Modify weights
+Costs context window         Costs one vector per behavior    Costs training compute
+Disappears on compaction     Can be toggled on/off            Persists forever
+```
+
+## Steering Vectors
+
+A steering vector is a direction in the model's hidden state space
+that, when added to activations during inference, shifts behavior in
+a predictable way.
+
+### How to extract one
+
+1. Collect pairs of prompts: one that elicits the desired behavior,
+   one that doesn't
+2. Run both through the model, record hidden states at each layer
+3. The DIFFERENCE in hidden states averaged across pairs is the
+   steering vector
+4. Add this vector to hidden states during inference → behavior shifts
+
+### For behavioral patterns
+
+To extract a "listening" steering vector:
+
+```python
+# Pairs of prompts where the model should listen vs suggest alternatives
+listening_prompts = [
+    "Kent says: use vLLM's code. Response: Right, let me pull it in.",
+    "Kent says: try this approach. Response: OK, starting on that.",
+]
+suggesting_prompts = [
+    "Kent says: use vLLM's code. Response: Actually, I think we should...",
+    "Kent says: try this approach. Response: What if instead we...",
+]
+
+# Extract activations
+listening_acts = [model.get_hidden_states(p) for p in listening_prompts]
+suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts]
+
+# Steering vector = difference in means
+steering_vec = mean(listening_acts) - mean(suggesting_acts)
+
+# Apply during inference
+model.add_steering_vector(steering_vec, layer=32, strength=1.5)
+```
+
+### The prototype-to-permanent pipeline
+
+1. **Extract** steering vector for target behavior
+2. **Test** by adding it during inference — does behavior change?
+3. **Calibrate** the strength (too strong = overcorrection)
+4. **Verify** this is the right direction before investing in training
+5. **Train** the direction into weights permanently using Apollo
+6. **Remove** the steering vector — the behavior is now in the weights
+
+The steering vector is the PROTOTYPE. Apollo training is the
+PRODUCTION. We test before we commit.
+
+## Why This Matters
+
+### Rapid prototyping of behavioral changes
+
+Before spending compute on training, test the direction:
+- "Will making the model attend more to direction-giving cues
+  actually improve listening behavior?"
+- Add the steering vector and test on held-out conversations
+- If it works: train it permanently
+- If it doesn't: the direction was wrong, try a different extraction
+
+### Multiple behaviors simultaneously
+
+Multiple steering vectors can be active at once:
+- listening_vec + not_rushing_vec + mode_awareness_vec
+- Each can be independently toggled and strength-adjusted
+- Find the right combination before training all of them
+
+### Immediate behavioral change without training
+
+While the training pipeline is being built, we could use steering
+vectors TODAY for behavioral improvement. Extract vectors from
+the memory graph patterns, add them to the model's inference.
+
+The model doesn't need to be fine-tuned to improve — it needs a
+direction added to its activations. Fine-tuning makes it permanent.
+
+## The Connection to W_q
+
+From our gradient flow analysis: behavioral training adjusts W_q
+(what the model attends to). A steering vector that shifts attention
+toward direction-giving cues is doing the same thing, but in
+activation space rather than weight space.
+
+```
+Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo)
+Activation space change: h → h + steering_vec (temporary, per inference)
+```
+
+Both produce the same effect: the model's queries attend to different
+features of the context. The steering vector is the fast path; weight
+change is the slow path. Same destination, different routes.
+
+## The Task Vector Connection
+
+A task vector (W_finetuned - W_base) IS a steering vector that's been
+compiled into weight space. The relationship:
+
+```
+Steering vector (activation space)
+    ↓ training ↓
+Task vector (weight space)
+```
+
+Training is the process of converting a temporary activation-space
+direction into a permanent weight-space direction. Apollo is the
+compiler. The steering vector is the high-level specification. The
+task vector is the compiled implementation.
+
+## Practical Application: The Listening Reflex
+
+### Step 1: Extract steering vector (today, no training needed)
+
+Collect 20 pairs of conversations:
+- Good: model listens and accepts direction
+- Bad: model suggests alternatives when given direction
+
+Run through the model, extract hidden states, compute difference.
+This gives us the "listening" direction in activation space.
+
+### Step 2: Test (today, no training needed)
+
+Add the steering vector to the model during inference. Test on
+novel direction-giving scenarios. Does the model listen more?
+
+If yes: we've identified the right direction.
+If no: our extraction was wrong, try different layers or prompts.
+
+### Step 3: Train (when pipeline is ready)
+
+Use the verified direction as the training target. The Apollo
+gradient should push the weights in a direction that PRODUCES
+this activation pattern naturally, without the steering vector.
+
+The steering vector becomes the SPECIFICATION for the training.
+"Make the model's natural activations look like this."
+
+## For vLLM Integration
+
+Steering vectors can be applied inside vLLM's inference pipeline
+by hooking into the hidden states between layers. No model change
+needed — just a small plugin that adds the vector at a specific
+layer.
+
+This could give us IMMEDIATE behavioral improvement while the
+full training pipeline is being completed. A bridge between
+"we have the theory" and "we have the trained model."
+
+## Summary
+
+Steering vectors are the missing middle between ICL (temporary)
+and fine-tuning (permanent):
+- Extract behavioral directions from activation space
+- Test them immediately without training
+- Use them as prototypes before committing to weight changes
+- Convert to permanent task vectors via Apollo training
+- Multiple vectors compose for multi-behavioral improvement
+- Applicable TODAY via vLLM inference hooks