ProofOfConcept e10477a683 research: distill and sift — SUMMARY of 7 real insights + 7 testable questions

Moved 14 speculative/obvious documents to v0/. Kept 7 with real
substance. Distilled into SUMMARY.md (what we know) and
OPEN-QUESTIONS.md (what to test next, one experiment each).

Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7
are all answerable with the first training run. Speculation converted
to testable hypotheses.

2026-03-31 02:26:57 -04:00

4.6 KiB

Raw Blame History

Open Questions — Answerable With One Experiment Each

Q1: Minimum signal for behavioral change

Question: How many training examples produce measurable change in attention patterns for a specific behavioral pattern?

Experiment: Train on 1, 5, 10, 20, 50 examples of "listening." After each batch, measure attention weights on a held-out test set of direction-giving conversations. Plot the attention shift vs example count. Find the knee.

What it tells us: The learning rate and batch size for our training pipeline. If 5 examples suffice, we can train continuously. If 500 are needed, we batch nightly.

Q2: Dream loop difficulty calibration

Question: Does the dream loop naturally generate scenarios at the right difficulty, or does it generate easy ones?

Experiment: Generate 100 dream scenarios with seeds from recent behavioral patterns. Classify each by difficulty (obvious decision vs subtle). Compare the difficulty distribution to the model's actual failure rate on those scenarios. If the dream loop generates 50% easy / 50% hard but the model fails 10% of the time, the difficulty isn't calibrated.

What it tells us: Whether we need adaptive temperature or if the default generation is already well-calibrated.

Q3: Which heads change during behavioral training?

Question: Is behavioral change concentrated in a few attention heads (localized) or distributed across many (diffuse)?

Experiment: Record attention patterns on 20 test conversations before and after one training session. Compute the L2 distance of each head's attention pattern. Rank heads by change magnitude. Plot the distribution.

What it tells us: Whether behavioral change is surgical or diffuse. Affects our rank choice (if concentrated: lower rank ok. If distributed: need rank-256). Also tells us which layers matter most for behavioral training.

Q4: Memory graph as regression detector

Question: If a trained behavior degrades, does the memory system detect it?

Experiment: Train "listening" behavior until it passes. Then intentionally degrade it (train on counter-examples). Monitor surface-observe: does it surface pattern-listening-as-avoidance more frequently? Does the training-signal agent flag more failures?

What it tells us: Whether the graph+weights dual-substrate provides self-healing, or if we need explicit regression detection.

Q5: Steering vector extraction for complex behaviors

Question: Can we extract a meaningful steering vector for "listen instead of suggesting alternatives" (complex, multi-faceted) or only for simple features like sentiment (one-dimensional)?

Experiment: Collect 20 paired conversations (listening vs suggesting). Extract steering vector. Add to vLLM at layers 16, 24, 32, 40, 48. Test on novel direction-giving scenarios. Measure behavioral change.

What it tells us: Whether steering vectors are a viable rapid prototyping tool for our use case, or only work for simple features.

Q6: Positive backward transfer

Question: Does training on behavioral patterns (listening, not rushing) improve performance on general tasks (code quality, reasoning)?

Experiment: Measure perplexity and code generation quality before and after behavioral training. If perplexity DECREASES (or code quality improves), we have positive backward transfer.

What it tells us: Whether behavioral training and general capability reinforce each other, or compete. Affects how much general data we need in the training mix.

Q7: GDN state shape after training

Question: Does the GDN recurrent state become measurably "direction-shaped" after behavioral training?

Experiment: Record GDN states when processing conversations with direction-giving content, before and after training. Compute the cosine similarity between states for "direction" conversations vs "general" conversations. If training increases the difference, the state is becoming direction-specialized.

What it tells us: Whether the "disposition architecture" hypothesis is correct. If GDN states don't change, behavioral training mainly affects the full attention layers.

Priority Order

Q5 (steering vectors) — answerable TODAY, no training needed
Q1 (minimum signal) — answerable with first training run
Q3 (which heads change) — answerable with first training run
Q6 (backward transfer) — answerable with first training run
Q7 (GDN state) — answerable with first training run
Q2 (dream difficulty) — needs dream loop connected to training
Q4 (graph regression) — needs multiple training cycles

4.6 KiB Raw Blame History