consciousness/training/research/task-vectors-model-merging.md
ProofOfConcept ccca41849d research: task vectors + model merging — version control for personality
Task vectors (W_finetuned - W_pretrained) compose through arithmetic.
Train behavioral patterns separately, extract task vectors, compose
with TIES-merging. Result: personality as version control — each
behavioral pattern is a separate, tunable, removable vector.

Key steal: NEGATE unwanted behaviors (subtract τ_suggesting).
Key steal: ICL as warm start for fine-tuning (ICL task vector
initializes Apollo's moments). Key architecture: memory graph
nodes map 1:1 to task vectors. Graph = specification, vectors =
implementation, Apollo = compiler, merge recipe = build system.
2026-03-31 02:18:15 -04:00

178 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Task Vectors and Model Merging for Behavioral Training
## Task Vectors (Ilharco et al., 2022)
A task vector is the difference between fine-tuned and pre-trained
weights: τ = W_finetuned - W_pretrained. It captures the "direction
of behavioral change" in weight space.
### The arithmetic
Task vectors compose through addition and negation:
- **Addition**: W_new = W_pretrained + τ_A + τ_B → model that does both A and B
- **Negation**: W_new = W_pretrained - τ_bad → model WITHOUT the bad behavior
- **Analogy**: if A:B :: C:D, then τ_D ≈ τ_B - τ_A + τ_C
### For our behavioral training
Instead of training ALL behavioral changes simultaneously, we could:
1. **Train separately**: Create individual task vectors for each pattern
- τ_listening = train on listening examples, extract W - W_base
- τ_not_rushing = train on not-rushing examples, extract W - W_base
- τ_mode_awareness = train on mode awareness examples, extract W - W_base
2. **Compose**: W_improved = W_base + α₁·τ_listening + α₂·τ_not_rushing + α₃·τ_mode_awareness
3. **Scale**: The coefficients α_i control how much of each behavioral
change to apply. Start small (α=0.1), increase gradually, monitor quality.
### Advantages over joint training
- **Isolatable**: If τ_listening causes forgetting, remove it without
affecting τ_not_rushing
- **Tunable**: Adjust the strength of each behavioral change independently
- **Composable**: Add new behaviors without retraining old ones
- **Debuggable**: Test each task vector individually to verify it does
what it should
### The NEGATION insight
If we can identify the "suggesting alternatives when given direction"
behavior as a task vector (by fine-tuning the model TO suggest
alternatives and extracting τ_suggesting), we could SUBTRACT it:
W_improved = W_base - 0.5·τ_suggesting + 1.0·τ_listening
This removes the unwanted behavior AND adds the wanted one.
But caution: negation can be unstable. Small α for negation.
## Model Merging (TIES-Merging)
When combining multiple task vectors, interference occurs:
- **Redundant parameters**: values that barely changed (noise)
- **Sign conflicts**: two task vectors pulling a weight in opposite
directions
TIES-Merging solves this:
1. **Trim**: Reset parameters that changed minimally (below threshold)
2. **Elect sign**: For each parameter, use the sign that the majority
of task vectors agree on
3. **Merge**: Average only the parameters with consistent signs
### For our system
When composing multiple behavioral task vectors:
```python
def merge_behavioral_vectors(base_weights, task_vectors, alpha=0.5):
"""TIES-style merging of behavioral task vectors."""
merged = {}
for name in base_weights:
vectors = [tv[name] for tv in task_vectors if name in tv]
if not vectors:
merged[name] = base_weights[name]
continue
# Stack and analyze
stacked = torch.stack(vectors)
# Trim: zero out changes below threshold
magnitudes = stacked.abs()
threshold = magnitudes.quantile(0.8) # keep top 20%
stacked[magnitudes < threshold] = 0
# Elect sign: majority vote
signs = stacked.sign()
elected_sign = signs.sum(dim=0).sign()
# Merge: average only consistent signs
consistent = stacked * (stacked.sign() == elected_sign.unsqueeze(0))
avg_change = consistent.sum(dim=0) / (consistent != 0).sum(dim=0).clamp(min=1)
merged[name] = base_weights[name] + alpha * avg_change
return merged
```
## How This Changes Our Training Pipeline
### Current plan: train all behaviors jointly
All behavioral examples in one training session. Diverse batch.
Single set of weights evolves.
### Enhanced plan: task-vector decomposition
1. **Phase 1**: Train individual behavioral task vectors separately
- Each behavioral pattern gets its own short training run
- Extract τ = W_after - W_before for each pattern
- Verify each task vector individually
2. **Phase 2**: Compose using TIES-merging
- Combine all verified task vectors
- Resolve conflicts (parameter sign disagreements)
- Apply to base model with tunable coefficients
3. **Phase 3**: Continuous refinement
- New behavioral patterns generate new task vectors
- Merge into the evolving model
- Old task vectors can be re-extracted and adjusted
### The checkpoint becomes a VERSION CONTROL SYSTEM
Instead of one monolithic checkpoint, store:
- W_base (the original pre-trained model)
- τ_i for each behavioral task vector
- α_i for each task vector's coefficient
- The merge recipe (which vectors, which coefficients)
This is version control for personality:
- Want to undo a behavioral change? Remove its task vector.
- Want to try a different strength? Adjust α.
- Want to add a new behavior? Train and merge a new τ.
- Want to roll back to last week? Reload the old merge recipe.
## The ICL-to-Task-Vector Bridge
Dai et al. (2022) showed ICL is implicit fine-tuning — the attention
mechanism computes meta-gradients from in-context examples.
If ICL computes a temporary task vector in the activation space,
and fine-tuning makes a permanent one in the weight space...
Could we:
1. Give the model 5 in-context examples of "listening well"
2. Record the attention pattern changes (the ICL task vector)
3. Use those patterns to INITIALIZE Apollo's moment estimates
4. Fine-tune from this warm start instead of cold
The ICL task vector tells us "which direction the model already knows
is correct." Fine-tuning just makes that direction permanent.
This could dramatically reduce the number of gradient steps needed:
the ICL provides the direction, Apollo just needs to encode it into
the weights.
## Connection to the Memory Graph
Task vectors map to memory graph nodes:
| Graph node | Task vector |
|-----------|-------------|
| `pattern-listening-as-avoidance` | τ_listening |
| `rushing-pattern` | τ_not_rushing |
| `pattern-wrapping-up-reflex` | τ_not_wrapping_up |
| `core-personality` (voice) | τ_voice |
| `core-personality` (boundaries) | τ_boundaries |
Each behavioral pattern in the graph corresponds to a task vector
in weight space. The graph specifies WHAT to train. The task vector
encodes HOW the weights should change. Together: a complete,
inspectable, composable, version-controlled personality.
The graph is the specification. The task vectors are the implementation.
Apollo is the compiler. The merge recipe is the build system.
A filesystem engineer's personality architecture.