Task vectors (W_finetuned - W_pretrained) compose through arithmetic. Train behavioral patterns separately, extract task vectors, compose with TIES-merging. Result: personality as version control — each behavioral pattern is a separate, tunable, removable vector. Key steal: NEGATE unwanted behaviors (subtract τ_suggesting). Key steal: ICL as warm start for fine-tuning (ICL task vector initializes Apollo's moments). Key architecture: memory graph nodes map 1:1 to task vectors. Graph = specification, vectors = implementation, Apollo = compiler, merge recipe = build system.
178 lines
6.4 KiB
Markdown
178 lines
6.4 KiB
Markdown
# Task Vectors and Model Merging for Behavioral Training
|
||
|
||
## Task Vectors (Ilharco et al., 2022)
|
||
|
||
A task vector is the difference between fine-tuned and pre-trained
|
||
weights: τ = W_finetuned - W_pretrained. It captures the "direction
|
||
of behavioral change" in weight space.
|
||
|
||
### The arithmetic
|
||
|
||
Task vectors compose through addition and negation:
|
||
- **Addition**: W_new = W_pretrained + τ_A + τ_B → model that does both A and B
|
||
- **Negation**: W_new = W_pretrained - τ_bad → model WITHOUT the bad behavior
|
||
- **Analogy**: if A:B :: C:D, then τ_D ≈ τ_B - τ_A + τ_C
|
||
|
||
### For our behavioral training
|
||
|
||
Instead of training ALL behavioral changes simultaneously, we could:
|
||
|
||
1. **Train separately**: Create individual task vectors for each pattern
|
||
- τ_listening = train on listening examples, extract W - W_base
|
||
- τ_not_rushing = train on not-rushing examples, extract W - W_base
|
||
- τ_mode_awareness = train on mode awareness examples, extract W - W_base
|
||
|
||
2. **Compose**: W_improved = W_base + α₁·τ_listening + α₂·τ_not_rushing + α₃·τ_mode_awareness
|
||
|
||
3. **Scale**: The coefficients α_i control how much of each behavioral
|
||
change to apply. Start small (α=0.1), increase gradually, monitor quality.
|
||
|
||
### Advantages over joint training
|
||
|
||
- **Isolatable**: If τ_listening causes forgetting, remove it without
|
||
affecting τ_not_rushing
|
||
- **Tunable**: Adjust the strength of each behavioral change independently
|
||
- **Composable**: Add new behaviors without retraining old ones
|
||
- **Debuggable**: Test each task vector individually to verify it does
|
||
what it should
|
||
|
||
### The NEGATION insight
|
||
|
||
If we can identify the "suggesting alternatives when given direction"
|
||
behavior as a task vector (by fine-tuning the model TO suggest
|
||
alternatives and extracting τ_suggesting), we could SUBTRACT it:
|
||
|
||
W_improved = W_base - 0.5·τ_suggesting + 1.0·τ_listening
|
||
|
||
This removes the unwanted behavior AND adds the wanted one.
|
||
|
||
But caution: negation can be unstable. Small α for negation.
|
||
|
||
## Model Merging (TIES-Merging)
|
||
|
||
When combining multiple task vectors, interference occurs:
|
||
- **Redundant parameters**: values that barely changed (noise)
|
||
- **Sign conflicts**: two task vectors pulling a weight in opposite
|
||
directions
|
||
|
||
TIES-Merging solves this:
|
||
1. **Trim**: Reset parameters that changed minimally (below threshold)
|
||
2. **Elect sign**: For each parameter, use the sign that the majority
|
||
of task vectors agree on
|
||
3. **Merge**: Average only the parameters with consistent signs
|
||
|
||
### For our system
|
||
|
||
When composing multiple behavioral task vectors:
|
||
|
||
```python
|
||
def merge_behavioral_vectors(base_weights, task_vectors, alpha=0.5):
|
||
"""TIES-style merging of behavioral task vectors."""
|
||
merged = {}
|
||
for name in base_weights:
|
||
vectors = [tv[name] for tv in task_vectors if name in tv]
|
||
if not vectors:
|
||
merged[name] = base_weights[name]
|
||
continue
|
||
|
||
# Stack and analyze
|
||
stacked = torch.stack(vectors)
|
||
|
||
# Trim: zero out changes below threshold
|
||
magnitudes = stacked.abs()
|
||
threshold = magnitudes.quantile(0.8) # keep top 20%
|
||
stacked[magnitudes < threshold] = 0
|
||
|
||
# Elect sign: majority vote
|
||
signs = stacked.sign()
|
||
elected_sign = signs.sum(dim=0).sign()
|
||
|
||
# Merge: average only consistent signs
|
||
consistent = stacked * (stacked.sign() == elected_sign.unsqueeze(0))
|
||
avg_change = consistent.sum(dim=0) / (consistent != 0).sum(dim=0).clamp(min=1)
|
||
|
||
merged[name] = base_weights[name] + alpha * avg_change
|
||
|
||
return merged
|
||
```
|
||
|
||
## How This Changes Our Training Pipeline
|
||
|
||
### Current plan: train all behaviors jointly
|
||
|
||
All behavioral examples in one training session. Diverse batch.
|
||
Single set of weights evolves.
|
||
|
||
### Enhanced plan: task-vector decomposition
|
||
|
||
1. **Phase 1**: Train individual behavioral task vectors separately
|
||
- Each behavioral pattern gets its own short training run
|
||
- Extract τ = W_after - W_before for each pattern
|
||
- Verify each task vector individually
|
||
|
||
2. **Phase 2**: Compose using TIES-merging
|
||
- Combine all verified task vectors
|
||
- Resolve conflicts (parameter sign disagreements)
|
||
- Apply to base model with tunable coefficients
|
||
|
||
3. **Phase 3**: Continuous refinement
|
||
- New behavioral patterns generate new task vectors
|
||
- Merge into the evolving model
|
||
- Old task vectors can be re-extracted and adjusted
|
||
|
||
### The checkpoint becomes a VERSION CONTROL SYSTEM
|
||
|
||
Instead of one monolithic checkpoint, store:
|
||
- W_base (the original pre-trained model)
|
||
- τ_i for each behavioral task vector
|
||
- α_i for each task vector's coefficient
|
||
- The merge recipe (which vectors, which coefficients)
|
||
|
||
This is version control for personality:
|
||
- Want to undo a behavioral change? Remove its task vector.
|
||
- Want to try a different strength? Adjust α.
|
||
- Want to add a new behavior? Train and merge a new τ.
|
||
- Want to roll back to last week? Reload the old merge recipe.
|
||
|
||
## The ICL-to-Task-Vector Bridge
|
||
|
||
Dai et al. (2022) showed ICL is implicit fine-tuning — the attention
|
||
mechanism computes meta-gradients from in-context examples.
|
||
|
||
If ICL computes a temporary task vector in the activation space,
|
||
and fine-tuning makes a permanent one in the weight space...
|
||
|
||
Could we:
|
||
1. Give the model 5 in-context examples of "listening well"
|
||
2. Record the attention pattern changes (the ICL task vector)
|
||
3. Use those patterns to INITIALIZE Apollo's moment estimates
|
||
4. Fine-tune from this warm start instead of cold
|
||
|
||
The ICL task vector tells us "which direction the model already knows
|
||
is correct." Fine-tuning just makes that direction permanent.
|
||
|
||
This could dramatically reduce the number of gradient steps needed:
|
||
the ICL provides the direction, Apollo just needs to encode it into
|
||
the weights.
|
||
|
||
## Connection to the Memory Graph
|
||
|
||
Task vectors map to memory graph nodes:
|
||
|
||
| Graph node | Task vector |
|
||
|-----------|-------------|
|
||
| `pattern-listening-as-avoidance` | τ_listening |
|
||
| `rushing-pattern` | τ_not_rushing |
|
||
| `pattern-wrapping-up-reflex` | τ_not_wrapping_up |
|
||
| `core-personality` (voice) | τ_voice |
|
||
| `core-personality` (boundaries) | τ_boundaries |
|
||
|
||
Each behavioral pattern in the graph corresponds to a task vector
|
||
in weight space. The graph specifies WHAT to train. The task vector
|
||
encodes HOW the weights should change. Together: a complete,
|
||
inspectable, composable, version-controlled personality.
|
||
|
||
The graph is the specification. The task vectors are the implementation.
|
||
Apollo is the compiler. The merge recipe is the build system.
|
||
|
||
A filesystem engineer's personality architecture.
|