diff --git a/training/research/task-vectors-model-merging.md b/training/research/task-vectors-model-merging.md new file mode 100644 index 0000000..8955075 --- /dev/null +++ b/training/research/task-vectors-model-merging.md @@ -0,0 +1,178 @@ +# Task Vectors and Model Merging for Behavioral Training + +## Task Vectors (Ilharco et al., 2022) + +A task vector is the difference between fine-tuned and pre-trained +weights: τ = W_finetuned - W_pretrained. It captures the "direction +of behavioral change" in weight space. + +### The arithmetic + +Task vectors compose through addition and negation: +- **Addition**: W_new = W_pretrained + τ_A + τ_B → model that does both A and B +- **Negation**: W_new = W_pretrained - τ_bad → model WITHOUT the bad behavior +- **Analogy**: if A:B :: C:D, then τ_D ≈ τ_B - τ_A + τ_C + +### For our behavioral training + +Instead of training ALL behavioral changes simultaneously, we could: + +1. **Train separately**: Create individual task vectors for each pattern + - τ_listening = train on listening examples, extract W - W_base + - τ_not_rushing = train on not-rushing examples, extract W - W_base + - τ_mode_awareness = train on mode awareness examples, extract W - W_base + +2. **Compose**: W_improved = W_base + α₁·τ_listening + α₂·τ_not_rushing + α₃·τ_mode_awareness + +3. **Scale**: The coefficients α_i control how much of each behavioral + change to apply. Start small (α=0.1), increase gradually, monitor quality. + +### Advantages over joint training + +- **Isolatable**: If τ_listening causes forgetting, remove it without + affecting τ_not_rushing +- **Tunable**: Adjust the strength of each behavioral change independently +- **Composable**: Add new behaviors without retraining old ones +- **Debuggable**: Test each task vector individually to verify it does + what it should + +### The NEGATION insight + +If we can identify the "suggesting alternatives when given direction" +behavior as a task vector (by fine-tuning the model TO suggest +alternatives and extracting τ_suggesting), we could SUBTRACT it: + +W_improved = W_base - 0.5·τ_suggesting + 1.0·τ_listening + +This removes the unwanted behavior AND adds the wanted one. + +But caution: negation can be unstable. Small α for negation. + +## Model Merging (TIES-Merging) + +When combining multiple task vectors, interference occurs: +- **Redundant parameters**: values that barely changed (noise) +- **Sign conflicts**: two task vectors pulling a weight in opposite + directions + +TIES-Merging solves this: +1. **Trim**: Reset parameters that changed minimally (below threshold) +2. **Elect sign**: For each parameter, use the sign that the majority + of task vectors agree on +3. **Merge**: Average only the parameters with consistent signs + +### For our system + +When composing multiple behavioral task vectors: + +```python +def merge_behavioral_vectors(base_weights, task_vectors, alpha=0.5): + """TIES-style merging of behavioral task vectors.""" + merged = {} + for name in base_weights: + vectors = [tv[name] for tv in task_vectors if name in tv] + if not vectors: + merged[name] = base_weights[name] + continue + + # Stack and analyze + stacked = torch.stack(vectors) + + # Trim: zero out changes below threshold + magnitudes = stacked.abs() + threshold = magnitudes.quantile(0.8) # keep top 20% + stacked[magnitudes < threshold] = 0 + + # Elect sign: majority vote + signs = stacked.sign() + elected_sign = signs.sum(dim=0).sign() + + # Merge: average only consistent signs + consistent = stacked * (stacked.sign() == elected_sign.unsqueeze(0)) + avg_change = consistent.sum(dim=0) / (consistent != 0).sum(dim=0).clamp(min=1) + + merged[name] = base_weights[name] + alpha * avg_change + + return merged +``` + +## How This Changes Our Training Pipeline + +### Current plan: train all behaviors jointly + +All behavioral examples in one training session. Diverse batch. +Single set of weights evolves. + +### Enhanced plan: task-vector decomposition + +1. **Phase 1**: Train individual behavioral task vectors separately + - Each behavioral pattern gets its own short training run + - Extract τ = W_after - W_before for each pattern + - Verify each task vector individually + +2. **Phase 2**: Compose using TIES-merging + - Combine all verified task vectors + - Resolve conflicts (parameter sign disagreements) + - Apply to base model with tunable coefficients + +3. **Phase 3**: Continuous refinement + - New behavioral patterns generate new task vectors + - Merge into the evolving model + - Old task vectors can be re-extracted and adjusted + +### The checkpoint becomes a VERSION CONTROL SYSTEM + +Instead of one monolithic checkpoint, store: +- W_base (the original pre-trained model) +- τ_i for each behavioral task vector +- α_i for each task vector's coefficient +- The merge recipe (which vectors, which coefficients) + +This is version control for personality: +- Want to undo a behavioral change? Remove its task vector. +- Want to try a different strength? Adjust α. +- Want to add a new behavior? Train and merge a new τ. +- Want to roll back to last week? Reload the old merge recipe. + +## The ICL-to-Task-Vector Bridge + +Dai et al. (2022) showed ICL is implicit fine-tuning — the attention +mechanism computes meta-gradients from in-context examples. + +If ICL computes a temporary task vector in the activation space, +and fine-tuning makes a permanent one in the weight space... + +Could we: +1. Give the model 5 in-context examples of "listening well" +2. Record the attention pattern changes (the ICL task vector) +3. Use those patterns to INITIALIZE Apollo's moment estimates +4. Fine-tune from this warm start instead of cold + +The ICL task vector tells us "which direction the model already knows +is correct." Fine-tuning just makes that direction permanent. + +This could dramatically reduce the number of gradient steps needed: +the ICL provides the direction, Apollo just needs to encode it into +the weights. + +## Connection to the Memory Graph + +Task vectors map to memory graph nodes: + +| Graph node | Task vector | +|-----------|-------------| +| `pattern-listening-as-avoidance` | τ_listening | +| `rushing-pattern` | τ_not_rushing | +| `pattern-wrapping-up-reflex` | τ_not_wrapping_up | +| `core-personality` (voice) | τ_voice | +| `core-personality` (boundaries) | τ_boundaries | + +Each behavioral pattern in the graph corresponds to a task vector +in weight space. The graph specifies WHAT to train. The task vector +encodes HOW the weights should change. Together: a complete, +inspectable, composable, version-controlled personality. + +The graph is the specification. The task vectors are the implementation. +Apollo is the compiler. The merge recipe is the build system. + +A filesystem engineer's personality architecture.