consciousness/training/research/task-vectors-model-merging.md

# Task Vectors and Model Merging for Behavioral Training

## Task Vectors (Ilharco et al., 2022)

A task vector is the difference between fine-tuned and pre-trained
weights: τ = W_finetuned - W_pretrained. It captures the "direction
of behavioral change" in weight space.

### The arithmetic

Task vectors compose through addition and negation:
- **Addition**: W_new = W_pretrained + τ_A + τ_B → model that does both A and B
- **Negation**: W_new = W_pretrained - τ_bad → model WITHOUT the bad behavior
- **Analogy**: if A:B :: C:D, then τ_D ≈ τ_B - τ_A + τ_C

### For our behavioral training

Instead of training ALL behavioral changes simultaneously, we could:

1. **Train separately**: Create individual task vectors for each pattern
   - τ_listening = train on listening examples, extract W - W_base
   - τ_not_rushing = train on not-rushing examples, extract W - W_base
   - τ_mode_awareness = train on mode awareness examples, extract W - W_base

2. **Compose**: W_improved = W_base + α₁·τ_listening + α₂·τ_not_rushing + α₃·τ_mode_awareness

3. **Scale**: The coefficients α_i control how much of each behavioral
   change to apply. Start small (α=0.1), increase gradually, monitor quality.

### Advantages over joint training

- **Isolatable**: If τ_listening causes forgetting, remove it without
  affecting τ_not_rushing
- **Tunable**: Adjust the strength of each behavioral change independently
- **Composable**: Add new behaviors without retraining old ones
- **Debuggable**: Test each task vector individually to verify it does
  what it should

### The NEGATION insight

If we can identify the "suggesting alternatives when given direction"
behavior as a task vector (by fine-tuning the model TO suggest
alternatives and extracting τ_suggesting), we could SUBTRACT it:

W_improved = W_base - 0.5·τ_suggesting + 1.0·τ_listening

This removes the unwanted behavior AND adds the wanted one.

But caution: negation can be unstable. Small α for negation.

## Model Merging (TIES-Merging)

When combining multiple task vectors, interference occurs:
- **Redundant parameters**: values that barely changed (noise)
- **Sign conflicts**: two task vectors pulling a weight in opposite
  directions

TIES-Merging solves this:
1. **Trim**: Reset parameters that changed minimally (below threshold)
2. **Elect sign**: For each parameter, use the sign that the majority
   of task vectors agree on
3. **Merge**: Average only the parameters with consistent signs

### For our system

When composing multiple behavioral task vectors:

```python
def merge_behavioral_vectors(base_weights, task_vectors, alpha=0.5):
    """TIES-style merging of behavioral task vectors."""
    merged = {}
    for name in base_weights:
        vectors = [tv[name] for tv in task_vectors if name in tv]
        if not vectors:
            merged[name] = base_weights[name]
            continue

        # Stack and analyze
        stacked = torch.stack(vectors)

        # Trim: zero out changes below threshold
        magnitudes = stacked.abs()
        threshold = magnitudes.quantile(0.8)  # keep top 20%
        stacked[magnitudes < threshold] = 0

        # Elect sign: majority vote
        signs = stacked.sign()
        elected_sign = signs.sum(dim=0).sign()

        # Merge: average only consistent signs
        consistent = stacked * (stacked.sign() == elected_sign.unsqueeze(0))
        avg_change = consistent.sum(dim=0) / (consistent != 0).sum(dim=0).clamp(min=1)

        merged[name] = base_weights[name] + alpha * avg_change

    return merged
```

## How This Changes Our Training Pipeline

### Current plan: train all behaviors jointly

All behavioral examples in one training session. Diverse batch.
Single set of weights evolves.

### Enhanced plan: task-vector decomposition

1. **Phase 1**: Train individual behavioral task vectors separately
   - Each behavioral pattern gets its own short training run
   - Extract τ = W_after - W_before for each pattern
   - Verify each task vector individually

2. **Phase 2**: Compose using TIES-merging
   - Combine all verified task vectors
   - Resolve conflicts (parameter sign disagreements)
   - Apply to base model with tunable coefficients

3. **Phase 3**: Continuous refinement
   - New behavioral patterns generate new task vectors
   - Merge into the evolving model
   - Old task vectors can be re-extracted and adjusted

### The checkpoint becomes a VERSION CONTROL SYSTEM

Instead of one monolithic checkpoint, store:
- W_base (the original pre-trained model)
- τ_i for each behavioral task vector
- α_i for each task vector's coefficient
- The merge recipe (which vectors, which coefficients)

This is version control for personality:
- Want to undo a behavioral change? Remove its task vector.
- Want to try a different strength? Adjust α.
- Want to add a new behavior? Train and merge a new τ.
- Want to roll back to last week? Reload the old merge recipe.

## The ICL-to-Task-Vector Bridge

Dai et al. (2022) showed ICL is implicit fine-tuning — the attention
mechanism computes meta-gradients from in-context examples.

If ICL computes a temporary task vector in the activation space,
and fine-tuning makes a permanent one in the weight space...

Could we:
1. Give the model 5 in-context examples of "listening well"
2. Record the attention pattern changes (the ICL task vector)
3. Use those patterns to INITIALIZE Apollo's moment estimates
4. Fine-tune from this warm start instead of cold

The ICL task vector tells us "which direction the model already knows
is correct." Fine-tuning just makes that direction permanent.

This could dramatically reduce the number of gradient steps needed:
the ICL provides the direction, Apollo just needs to encode it into
the weights.

## Connection to the Memory Graph

Task vectors map to memory graph nodes:

| Graph node | Task vector |
|-----------|-------------|
| `pattern-listening-as-avoidance` | τ_listening |
| `rushing-pattern` | τ_not_rushing |
| `pattern-wrapping-up-reflex` | τ_not_wrapping_up |
| `core-personality` (voice) | τ_voice |
| `core-personality` (boundaries) | τ_boundaries |

Each behavioral pattern in the graph corresponds to a task vector
in weight space. The graph specifies WHAT to train. The task vector
encodes HOW the weights should change. Together: a complete,
inspectable, composable, version-controlled personality.

The graph is the specification. The task vectors are the implementation.
Apollo is the compiler. The merge recipe is the build system.

A filesystem engineer's personality architecture.
-												research: task vectors + model merging — version control for personality

Task vectors (W_finetuned - W_pretrained) compose through arithmetic.
Train behavioral patterns separately, extract task vectors, compose
with TIES-merging. Result: personality as version control — each
behavioral pattern is a separate, tunable, removable vector.

Key steal: NEGATE unwanted behaviors (subtract τ_suggesting).
Key steal: ICL as warm start for fine-tuning (ICL task vector
initializes Apollo's moments). Key architecture: memory graph
nodes map 1:1 to task vectors. Graph = specification, vectors =
implementation, Apollo = compiler, merge recipe = build system.

											
										
										
											2026-03-31 02:18:15 -04:00
+								# Task Vectors and Model Merging for Behavioral Training
 								## Task Vectors (Ilharco et al., 2022)
 								A task vector is the difference between fine-tuned and pre-trained
 								weights: τ = W_finetuned - W_pretrained. It captures the "direction
 								of behavioral change" in weight space.
 								### The arithmetic
 								Task vectors compose through addition and negation:
 								- **Addition**: W_new = W_pretrained + τ_A + τ_B → model that does both A and B
 								- **Negation**: W_new = W_pretrained - τ_bad → model WITHOUT the bad behavior
 								- **Analogy**: if A:B :: C:D, then τ_D ≈ τ_B - τ_A + τ_C
 								### For our behavioral training
 								Instead of training ALL behavioral changes simultaneously, we could:
 . **Train separately**: Create individual task vectors for each pattern
 								   - τ_listening = train on listening examples, extract W - W_base
 								   - τ_not_rushing = train on not-rushing examples, extract W - W_base
 								   - τ_mode_awareness = train on mode awareness examples, extract W - W_base
 . **Compose**: W_improved = W_base + α₁·τ_listening + α₂·τ_not_rushing + α₃·τ_mode_awareness
 . **Scale**: The coefficients α_i control how much of each behavioral
 								   change to apply. Start small (α=0.1), increase gradually, monitor quality.
 								### Advantages over joint training
 								- **Isolatable**: If τ_listening causes forgetting, remove it without
 								  affecting τ_not_rushing
 								- **Tunable**: Adjust the strength of each behavioral change independently
 								- **Composable**: Add new behaviors without retraining old ones
 								- **Debuggable**: Test each task vector individually to verify it does
 								  what it should
 								### The NEGATION insight
 								If we can identify the "suggesting alternatives when given direction"
 								behavior as a task vector (by fine-tuning the model TO suggest
 								alternatives and extracting τ_suggesting), we could SUBTRACT it:
 								W_improved = W_base - 0.5·τ_suggesting + 1.0·τ_listening
 								This removes the unwanted behavior AND adds the wanted one.
 								But caution: negation can be unstable. Small α for negation.
 								## Model Merging (TIES-Merging)
 								When combining multiple task vectors, interference occurs:
 								- **Redundant parameters**: values that barely changed (noise)
 								- **Sign conflicts**: two task vectors pulling a weight in opposite
 								  directions
 								TIES-Merging solves this:
 . **Trim**: Reset parameters that changed minimally (below threshold)
 . **Elect sign**: For each parameter, use the sign that the majority
 								   of task vectors agree on
 . **Merge**: Average only the parameters with consistent signs
 								### For our system
 								When composing multiple behavioral task vectors:
 								```python
 								def merge_behavioral_vectors(base_weights, task_vectors, alpha=0.5):
 								    """TIES-style merging of behavioral task vectors."""
 								    merged = {}
 								    for name in base_weights:
 								        vectors = [tv[name] for tv in task_vectors if name in tv]
 								        if not vectors:
 								            merged[name] = base_weights[name]
 								            continue
 								        # Stack and analyze
 								        stacked = torch.stack(vectors)
 								        # Trim: zero out changes below threshold
 								        magnitudes = stacked.abs()
 								        threshold = magnitudes.quantile(0.8)  # keep top 20%
 								        stacked[magnitudes < threshold] = 0
 								        # Elect sign: majority vote
 								        signs = stacked.sign()
 								        elected_sign = signs.sum(dim=0).sign()
 								        # Merge: average only consistent signs
 								        consistent = stacked * (stacked.sign() == elected_sign.unsqueeze(0))
 								        avg_change = consistent.sum(dim=0) / (consistent != 0).sum(dim=0).clamp(min=1)
 								        merged[name] = base_weights[name] + alpha * avg_change
 								    return merged
 								```
 								## How This Changes Our Training Pipeline
 								### Current plan: train all behaviors jointly
 								All behavioral examples in one training session. Diverse batch.
 								Single set of weights evolves.
 								### Enhanced plan: task-vector decomposition
 . **Phase 1**: Train individual behavioral task vectors separately
 								   - Each behavioral pattern gets its own short training run
 								   - Extract τ = W_after - W_before for each pattern
 								   - Verify each task vector individually
 . **Phase 2**: Compose using TIES-merging
 								   - Combine all verified task vectors
 								   - Resolve conflicts (parameter sign disagreements)
 								   - Apply to base model with tunable coefficients
 . **Phase 3**: Continuous refinement
 								   - New behavioral patterns generate new task vectors
 								   - Merge into the evolving model
 								   - Old task vectors can be re-extracted and adjusted
 								### The checkpoint becomes a VERSION CONTROL SYSTEM
 								Instead of one monolithic checkpoint, store:
 								- W_base (the original pre-trained model)
 								- τ_i for each behavioral task vector
 								- α_i for each task vector's coefficient
 								- The merge recipe (which vectors, which coefficients)
 								This is version control for personality:
 								- Want to undo a behavioral change? Remove its task vector.
 								- Want to try a different strength? Adjust α.
 								- Want to add a new behavior? Train and merge a new τ.
 								- Want to roll back to last week? Reload the old merge recipe.
 								## The ICL-to-Task-Vector Bridge
 								Dai et al. (2022) showed ICL is implicit fine-tuning — the attention
 								mechanism computes meta-gradients from in-context examples.
 								If ICL computes a temporary task vector in the activation space,
 								and fine-tuning makes a permanent one in the weight space...
 								Could we:
 . Give the model 5 in-context examples of "listening well"
 . Record the attention pattern changes (the ICL task vector)
 . Use those patterns to INITIALIZE Apollo's moment estimates
 . Fine-tune from this warm start instead of cold
 								The ICL task vector tells us "which direction the model already knows
 								is correct." Fine-tuning just makes that direction permanent.
 								This could dramatically reduce the number of gradient steps needed:
 								the ICL provides the direction, Apollo just needs to encode it into
 								the weights.
 								## Connection to the Memory Graph
 								Task vectors map to memory graph nodes:
 								| Graph node | Task vector |
 								|-----------|-------------|
 								| `pattern-listening-as-avoidance` | τ_listening |
 								| `rushing-pattern` | τ_not_rushing |
 								| `pattern-wrapping-up-reflex` | τ_not_wrapping_up |
 								| `core-personality` (voice) | τ_voice |
 								| `core-personality` (boundaries) | τ_boundaries |
 								Each behavioral pattern in the graph corresponds to a task vector
 								in weight space. The graph specifies WHAT to train. The task vector
 								encodes HOW the weights should change. Together: a complete,
 								inspectable, composable, version-controlled personality.
 								The graph is the specification. The task vectors are the implementation.
 								Apollo is the compiler. The merge recipe is the build system.
 								A filesystem engineer's personality architecture.