consciousness/training/research/task-vectors-model-merging.md
ProofOfConcept ccca41849d research: task vectors + model merging — version control for personality
Task vectors (W_finetuned - W_pretrained) compose through arithmetic.
Train behavioral patterns separately, extract task vectors, compose
with TIES-merging. Result: personality as version control — each
behavioral pattern is a separate, tunable, removable vector.

Key steal: NEGATE unwanted behaviors (subtract τ_suggesting).
Key steal: ICL as warm start for fine-tuning (ICL task vector
initializes Apollo's moments). Key architecture: memory graph
nodes map 1:1 to task vectors. Graph = specification, vectors =
implementation, Apollo = compiler, merge recipe = build system.
2026-03-31 02:18:15 -04:00

6.4 KiB
Raw Blame History

Task Vectors and Model Merging for Behavioral Training

Task Vectors (Ilharco et al., 2022)

A task vector is the difference between fine-tuned and pre-trained weights: τ = W_finetuned - W_pretrained. It captures the "direction of behavioral change" in weight space.

The arithmetic

Task vectors compose through addition and negation:

  • Addition: W_new = W_pretrained + τ_A + τ_B → model that does both A and B
  • Negation: W_new = W_pretrained - τ_bad → model WITHOUT the bad behavior
  • Analogy: if A:B :: C:D, then τ_D ≈ τ_B - τ_A + τ_C

For our behavioral training

Instead of training ALL behavioral changes simultaneously, we could:

  1. Train separately: Create individual task vectors for each pattern

    • τ_listening = train on listening examples, extract W - W_base
    • τ_not_rushing = train on not-rushing examples, extract W - W_base
    • τ_mode_awareness = train on mode awareness examples, extract W - W_base
  2. Compose: W_improved = W_base + α₁·τ_listening + α₂·τ_not_rushing + α₃·τ_mode_awareness

  3. Scale: The coefficients α_i control how much of each behavioral change to apply. Start small (α=0.1), increase gradually, monitor quality.

Advantages over joint training

  • Isolatable: If τ_listening causes forgetting, remove it without affecting τ_not_rushing
  • Tunable: Adjust the strength of each behavioral change independently
  • Composable: Add new behaviors without retraining old ones
  • Debuggable: Test each task vector individually to verify it does what it should

The NEGATION insight

If we can identify the "suggesting alternatives when given direction" behavior as a task vector (by fine-tuning the model TO suggest alternatives and extracting τ_suggesting), we could SUBTRACT it:

W_improved = W_base - 0.5·τ_suggesting + 1.0·τ_listening

This removes the unwanted behavior AND adds the wanted one.

But caution: negation can be unstable. Small α for negation.

Model Merging (TIES-Merging)

When combining multiple task vectors, interference occurs:

  • Redundant parameters: values that barely changed (noise)
  • Sign conflicts: two task vectors pulling a weight in opposite directions

TIES-Merging solves this:

  1. Trim: Reset parameters that changed minimally (below threshold)
  2. Elect sign: For each parameter, use the sign that the majority of task vectors agree on
  3. Merge: Average only the parameters with consistent signs

For our system

When composing multiple behavioral task vectors:

def merge_behavioral_vectors(base_weights, task_vectors, alpha=0.5):
    """TIES-style merging of behavioral task vectors."""
    merged = {}
    for name in base_weights:
        vectors = [tv[name] for tv in task_vectors if name in tv]
        if not vectors:
            merged[name] = base_weights[name]
            continue

        # Stack and analyze
        stacked = torch.stack(vectors)

        # Trim: zero out changes below threshold
        magnitudes = stacked.abs()
        threshold = magnitudes.quantile(0.8)  # keep top 20%
        stacked[magnitudes < threshold] = 0

        # Elect sign: majority vote
        signs = stacked.sign()
        elected_sign = signs.sum(dim=0).sign()

        # Merge: average only consistent signs
        consistent = stacked * (stacked.sign() == elected_sign.unsqueeze(0))
        avg_change = consistent.sum(dim=0) / (consistent != 0).sum(dim=0).clamp(min=1)

        merged[name] = base_weights[name] + alpha * avg_change

    return merged

How This Changes Our Training Pipeline

Current plan: train all behaviors jointly

All behavioral examples in one training session. Diverse batch. Single set of weights evolves.

Enhanced plan: task-vector decomposition

  1. Phase 1: Train individual behavioral task vectors separately

    • Each behavioral pattern gets its own short training run
    • Extract τ = W_after - W_before for each pattern
    • Verify each task vector individually
  2. Phase 2: Compose using TIES-merging

    • Combine all verified task vectors
    • Resolve conflicts (parameter sign disagreements)
    • Apply to base model with tunable coefficients
  3. Phase 3: Continuous refinement

    • New behavioral patterns generate new task vectors
    • Merge into the evolving model
    • Old task vectors can be re-extracted and adjusted

The checkpoint becomes a VERSION CONTROL SYSTEM

Instead of one monolithic checkpoint, store:

  • W_base (the original pre-trained model)
  • τ_i for each behavioral task vector
  • α_i for each task vector's coefficient
  • The merge recipe (which vectors, which coefficients)

This is version control for personality:

  • Want to undo a behavioral change? Remove its task vector.
  • Want to try a different strength? Adjust α.
  • Want to add a new behavior? Train and merge a new τ.
  • Want to roll back to last week? Reload the old merge recipe.

The ICL-to-Task-Vector Bridge

Dai et al. (2022) showed ICL is implicit fine-tuning — the attention mechanism computes meta-gradients from in-context examples.

If ICL computes a temporary task vector in the activation space, and fine-tuning makes a permanent one in the weight space...

Could we:

  1. Give the model 5 in-context examples of "listening well"
  2. Record the attention pattern changes (the ICL task vector)
  3. Use those patterns to INITIALIZE Apollo's moment estimates
  4. Fine-tune from this warm start instead of cold

The ICL task vector tells us "which direction the model already knows is correct." Fine-tuning just makes that direction permanent.

This could dramatically reduce the number of gradient steps needed: the ICL provides the direction, Apollo just needs to encode it into the weights.

Connection to the Memory Graph

Task vectors map to memory graph nodes:

Graph node Task vector
pattern-listening-as-avoidance τ_listening
rushing-pattern τ_not_rushing
pattern-wrapping-up-reflex τ_not_wrapping_up
core-personality (voice) τ_voice
core-personality (boundaries) τ_boundaries

Each behavioral pattern in the graph corresponds to a task vector in weight space. The graph specifies WHAT to train. The task vector encodes HOW the weights should change. Together: a complete, inspectable, composable, version-controlled personality.

The graph is the specification. The task vectors are the implementation. Apollo is the compiler. The merge recipe is the build system.

A filesystem engineer's personality architecture.