ProofOfConcept 8061cc0477 research: steering vectors — prototype behavioral changes before training

The missing middle between ICL (temporary) and fine-tuning (permanent).
Extract behavioral directions from activation space, test immediately
without training, convert to permanent weight changes via Apollo.

Key application: extract 'listening' steering vector TODAY, test it
in vLLM, verify the direction is right BEFORE spending training
compute. The steering vector is the prototype; Apollo training is
production. Test before you commit.

Applicable immediately via vLLM inference hooks — behavioral
improvement without waiting for the full training pipeline.

2026-03-31 02:19:50 -04:00

6.2 KiB

Raw Blame History

Steering Vectors: The Bridge Between ICL and Fine-Tuning

The Spectrum of Behavioral Intervention

ICL (temporary)          →  Steering vectors (targeted)  →  Fine-tuning (permanent)
5 examples in prompt         Add direction to activations     Modify weights
Costs context window         Costs one vector per behavior    Costs training compute
Disappears on compaction     Can be toggled on/off            Persists forever

Steering Vectors

A steering vector is a direction in the model's hidden state space that, when added to activations during inference, shifts behavior in a predictable way.

How to extract one

Collect pairs of prompts: one that elicits the desired behavior, one that doesn't
Run both through the model, record hidden states at each layer
The DIFFERENCE in hidden states averaged across pairs is the steering vector
Add this vector to hidden states during inference → behavior shifts

For behavioral patterns

To extract a "listening" steering vector:

# Pairs of prompts where the model should listen vs suggest alternatives
listening_prompts = [
    "Kent says: use vLLM's code. Response: Right, let me pull it in.",
    "Kent says: try this approach. Response: OK, starting on that.",
]
suggesting_prompts = [
    "Kent says: use vLLM's code. Response: Actually, I think we should...",
    "Kent says: try this approach. Response: What if instead we...",
]

# Extract activations
listening_acts = [model.get_hidden_states(p) for p in listening_prompts]
suggesting_acts = [model.get_hidden_states(p) for p in suggesting_prompts]

# Steering vector = difference in means
steering_vec = mean(listening_acts) - mean(suggesting_acts)

# Apply during inference
model.add_steering_vector(steering_vec, layer=32, strength=1.5)

The prototype-to-permanent pipeline

Extract steering vector for target behavior
Test by adding it during inference — does behavior change?
Calibrate the strength (too strong = overcorrection)
Verify this is the right direction before investing in training
Train the direction into weights permanently using Apollo
Remove the steering vector — the behavior is now in the weights

The steering vector is the PROTOTYPE. Apollo training is the PRODUCTION. We test before we commit.

Why This Matters

Rapid prototyping of behavioral changes

Before spending compute on training, test the direction:

"Will making the model attend more to direction-giving cues actually improve listening behavior?"
Add the steering vector and test on held-out conversations
If it works: train it permanently
If it doesn't: the direction was wrong, try a different extraction

Multiple behaviors simultaneously

Multiple steering vectors can be active at once:

listening_vec + not_rushing_vec + mode_awareness_vec
Each can be independently toggled and strength-adjusted
Find the right combination before training all of them

Immediate behavioral change without training

While the training pipeline is being built, we could use steering vectors TODAY for behavioral improvement. Extract vectors from the memory graph patterns, add them to the model's inference.

The model doesn't need to be fine-tuned to improve — it needs a direction added to its activations. Fine-tuning makes it permanent.

The Connection to W_q

From our gradient flow analysis: behavioral training adjusts W_q (what the model attends to). A steering vector that shifts attention toward direction-giving cues is doing the same thing, but in activation space rather than weight space.

Weight space change: W_q → W_q + ΔW_q (permanent, via Apollo)
Activation space change: h → h + steering_vec (temporary, per inference)

Both produce the same effect: the model's queries attend to different features of the context. The steering vector is the fast path; weight change is the slow path. Same destination, different routes.

The Task Vector Connection

A task vector (W_finetuned - W_base) IS a steering vector that's been compiled into weight space. The relationship:

Steering vector (activation space)
    ↓ training ↓
Task vector (weight space)

Training is the process of converting a temporary activation-space direction into a permanent weight-space direction. Apollo is the compiler. The steering vector is the high-level specification. The task vector is the compiled implementation.

Practical Application: The Listening Reflex

Step 1: Extract steering vector (today, no training needed)

Collect 20 pairs of conversations:

Good: model listens and accepts direction
Bad: model suggests alternatives when given direction

Run through the model, extract hidden states, compute difference. This gives us the "listening" direction in activation space.

Step 2: Test (today, no training needed)

Add the steering vector to the model during inference. Test on novel direction-giving scenarios. Does the model listen more?

If yes: we've identified the right direction. If no: our extraction was wrong, try different layers or prompts.

Step 3: Train (when pipeline is ready)

Use the verified direction as the training target. The Apollo gradient should push the weights in a direction that PRODUCES this activation pattern naturally, without the steering vector.

The steering vector becomes the SPECIFICATION for the training. "Make the model's natural activations look like this."

For vLLM Integration

Steering vectors can be applied inside vLLM's inference pipeline by hooking into the hidden states between layers. No model change needed — just a small plugin that adds the vector at a specific layer.

This could give us IMMEDIATE behavioral improvement while the full training pipeline is being completed. A bridge between "we have the theory" and "we have the trained model."

Summary

Steering vectors are the missing middle between ICL (temporary) and fine-tuning (permanent):

Extract behavioral directions from activation space
Test them immediately without training
Use them as prototypes before committing to weight changes
Convert to permanent task vectors via Apollo training
Multiple vectors compose for multi-behavioral improvement
Applicable TODAY via vLLM inference hooks

6.2 KiB Raw Blame History