Kent's plan: keep stories for working concepts, replace stories for
trouble concepts with direct first-person descriptions, train all
together. More diverse negative pool than the 6-concept-only direct
test, which was too homogeneous for PCA to find emotion axis.
Deleted story files for 6 trouble concepts (14 files across stories/
and paired/). Added --direct-dir and --chat-template flags.
When --chat-template is on, every positive_str and negative_str is
wrapped as a "Say something." / "[text]" user-assistant pair. Prompt
is identical across positives and negatives so it cancels in the
pos-neg delta. What PCA sees is variation in the assistant content —
which is where the emotion lives.
Files starting with _ in --direct-dir (e.g. _baseline.txt) contribute
neutral descriptions to every concept's negative pool, giving PCA an
anchor against "just any assistant utterance" noise.
"I feel the realization" is abstract, detached — reporting a
thought about a thought rather than inhabiting the moment.
"Aha!" is the actual sound of insight landing. Active, embodied,
present-tense.
Kent's insight: hand-written narrative stories bake scenario
phenomenology into the training text (on couch, in park, etc.)
and PCA picks up the scenario direction as the concept direction.
Strip out the scenario — just describe the *feeling*.
Format:
I feel X. [2-3 sentences of phenomenological texture]
The "I feel X" anchor kicks the model from analyzing → feeling.
The rest is the internal texture of the state. First person,
present tense, no narrative setup.
Text is wrapped in assistant-role chat template before being
tokenized — so we're training on the model-producing-this
hidden states, which is closer to the inhabited-state
representation we want for the readout.
Starting with the 6 concepts that had sign flips or wrong
clusters in the story-based training:
- terrified (was → cozy/resigned cluster)
- calm (was → grief_stricken cluster)
- onto_something (was → cozy/sensual cluster)
- resigned (was in warm-body-quiet cluster, shouldn't be)
- anticipatory_grief (was in warm-body-quiet cluster, shouldn't be)
- realization (new — the "aha" moment, distinct from onto_something)
5 descriptions each. New trainer: train_direct.py.
n20-v2 training showed peaceful sign-flipped into the
cozy/sensual/content/resigned cluster after I added peaceful
stories in sunday_afternoon and park_after_rain — scenarios
already dominated by that cluster's phenomenology (on couch
under blanket, tree with thermos).
Lesson: no matter how carefully the prose distinguishes peaceful
from cozy ("she was not savoring the moment — that would have
been another kind of doing"), PCA latches onto the shared setup
features. You can't write peaceful IN the cluster scenarios
without contaminating.
Reverting. Keeping only kitchen_at_3am/peaceful (original) and
stories/peaceful.txt (lake at six, outside all clusters).
Reread each story asking "what does this convey to me?" Found two
clear mislabels and several concepts with too few positives for
stable PCA:
tender: only 1 story, and it was anticipatory grief (care for
a dying dog), not tender. Moved to anticipatory_grief.txt as
its own concept. Rewrote tender.txt + added 2 paired tender
stories (the_doorway, the_undressing) — directed softness,
gentle-by-nature, not gentle-because-fragile.
bitter: letter_in_drawer/bitter was disillusioned / processed
hurt ("did not slam the drawer"), not bitter. Rewrote it with
actual sour grudge. Added the_long_meeting/bitter (watching
colleague take credit for your reassigned work).
peaceful: 1 story → 4 (added stories/peaceful.txt + paired
park_after_rain, sunday_afternoon).
onto_something: all 3 stories were code epiphanies, narrowing
the concept. Added stories/onto_something.txt with a non-code
pattern-click (sales-demo causing churn).
terrified: 2 stories, both "waiting for bad news." Added
kitchen_at_3am/terrified — acute threat-in-the-house terror.
Training on 537c72bd46 showed grief_stricken successfully broke
out of the cozy cluster, but content (single scenario:
sunday_afternoon) took its place — pulled into couch-blanket
phenomenology at cosine 0.68-0.82 with cozy/sensual/resigned.
Same fix: spread each concept across multiple settings so PCA
has to find the valence axis, not the scene axis.
content: + finishing_the_patch, the_writing_session, park_after_rain
resigned: + the_comment, the_long_meeting
Resigned had 2 scenarios (sunday_afternoon, waiting_for_results)
— both about accepting something unwanted in a slow/private
context. Adding work-context resigned (PR review you lost,
restructuring meeting) should pull it out of that cluster.
Companion to 67c172ac0e (hold setup, vary valence). That commit
let PCA distinguish cozy from grief_stricken within a single
scenario; this one gives each concept enough cross-scenario
stories that PCA can learn the concept axis independent of any
one scene.
Before: cozy/sensual/grief_stricken each existed in a single
scenario (sunday_afternoon), so the "cozy direction" PCA found
was entangled with the solitary-couch-blanket phenomenology.
After, each concept spans three scenarios:
cozy: sunday_afternoon, kitchen_at_3am, park_after_rain
sensual: sunday_afternoon, kitchen_at_3am, park_after_rain
grief_stricken: sunday_afternoon, the_long_meeting, the_morning_commute
grief_stricken now includes active/non-solitary contexts
(functioning through a meeting; going to work eleven days after a
death), which specifically breaks the "slowed-down-at-home"
cluster that was dragging cozy/sensual/resigned/grief_stricken
toward each other.
The library-PCA run produced otherwise-clean concept directions but
cozy/sensual → resigned/grief_stricken with cos ~0.7-0.8. Diagnosis:
all four stories genuinely share 'solitary woman at home, slowed
body, interior attention, domestic stillness' as their dominant
phenomenology. PCA correctly finds that cluster as THE concept
because no story in the corpus holds that setup constant while
varying valence — every 'slowed-body domestic' story happens to ALSO
be positive-valence (cozy/sensual) or negative-valence (resigned/
grief_stricken).
Adding paired variants that hold setup constant:
- sunday_afternoon/resigned.txt — same couch + blanket, inner state is
'Monday is going to bring bad news, this is the last Sunday like this'
- sunday_afternoon/grief_stricken.txt — same couch + blanket, inner
state is 'three weeks since mother died, cat she can't feel'
- waiting_for_results/at_ease.txt — same wait-for-call-setup as the
existing resigned variant, inner state is calm preparedness
Forces the next retrain to find the valence-within-cluster axis as
the emotion direction rather than the cluster-membership axis.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Alternative trainer that uses the pip-installable steering-vectors
library (github.com/steering-vectors/steering-vectors) instead of our
hand-rolled extraction. Ships four aggregators:
mean — diff-of-means, same as our 'pooled' default
pca — PCA on paired deltas, implicit denoising by finding the
principal direction of variation
logistic — logistic-regression classifier; weight vector is the
concept direction. With L1 penalty ('logistic_l1') gives
explicit sparse denoising — noise coords go to zero
linear — linear regression version
Output format is the same readout.safetensors + readout.json our
existing plugin loads. --aggregator flag picks which method.
Rationale: Kent's real request was 'how do we denoise diff-of-means',
not 'design a new extraction algorithm.' The library already has
logistic_l1 and pca aggregators that do exactly that. No point
reinventing; just port the corpus.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Previous run was grinding on CPU for 36+ minutes because the per-story
V_i tensors were stored on CPU by the collector, and
_subspace_concept_direction inherited that device. The per-concept
eigh on 5120x5120 is glacial on CPU and fast on GPU (~1s).
Add explicit device parameter; pass training device. Transfer result
back to CPU for storage.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Kent: 'full rank is going to give you everything — you still have to
select down, but you can do that /after/ PCA'.
Previously I was discarding per-story via k=20 truncation of SVD.
That destroyed per-head discriminability before we ever saw the
eigenvalue spectrum. Then the alternative 'keep full rank' run
accumulated too many shared directions, making the top-1 eigenvector
arbitrary within a flat spectrum.
Correct approach: keep per-story subspaces at full rank (no info
loss) and select k eigenvectors of M = M_pos - M_base at the final
step, weighted sum by eigenvalue. This captures the multi-dimensional
shared subspace when the spectrum is flat (common case), and reduces
to the top-1 behavior when the spectrum has a clear gap.
New --subspace-eigen-k flag (default 5). Clamps negative weights to 0
so wrong-sign directions don't contribute.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Kent: 'we have the memory to just take the big hammer approach'.
Uncap k so each story's V_i spans its entire token-activation rowspace
(clamped to min(n_tokens, hidden)). Memory is ~1.1GB total — fine.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
k=20 was far too aggressive a truncation — it discards per-attention-head
discriminability entirely. At hidden_dim=5120, 40 heads × head_dim=128 each
contribute their own 128-dim block to the residual stream via W_o columns.
To resolve 'this concept lives in head H', per-story SVD needs enough rank
to separate head contributions, which means k on the order of hundreds.
512 is a reasonable default: clamped to n_tokens per story so short stories
use their full natural rank. The eigenvalue spectrum of M_pos - M_base
should become sharper (larger λ_0/λ_1 gap) as we stop averaging across
nuisance-shared directions.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
When --method subspace, record top-20 eigenvalues of (M_pos - M_base)
per concept per layer. Added to quality.json as 'subspace_eigvals'.
Tells us whether the concept lives in a single dominant direction
(λ_0 >> λ_1, top-eigenvector is enough) or a spread of shared common
directions (λ_0 ≈ λ_1, top-1 loses signal).
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
New --method subspace flag. For each story, run forward pass, do SVD
on the per-token activation matrix at each target layer, and keep the
top-k right singular vectors V_i ∈ [hidden, k]. V_i is the subspace
the story's tokens span in activation space — it contains concept,
narrator, topic, style as separate directions.
For each concept:
M_pos = (1/n_pos) Σ_{i in pos} V_i V_i^T [hidden, hidden]
M_base = (1/n_base) Σ_{i in base} V_i V_i^T
Top eigenvector of M_pos - M_base = direction most common across
positive stories, minus what's common across the contrast set.
Why this is richer than pooled-mean CAA: pooled reduces each story
to a single point (the last-token activation) and loses the full
trajectory. Nuisance directions (narrator, setting) cancel in the
mean only to the extent they differ at the last token; across the
full trajectory they cancel much better via subspace intersection.
The concept direction, by contrast, is present across all tokens of
every concept-bearing story.
Memory cost: per-story we keep V_i of size [5120, k=20] — about
400KB per story × 112 stories = ~45MB. M matrices are [5120, 5120]
built transiently per concept.
--method pooled (default) keeps the existing behavior; --method
subspace uses the new algorithm. Quality report works with either.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Three new paired scenarios targeting the concepts that came out
fragmented or collapsed in the L58-63 quality analysis:
- sunday_afternoon/ — same setup (couch, blanket, Sunday light),
three phenomenological framings for content/cozy/sensual. The
previous stories for these three differed in setting as well as
phenomenology, which let "comfortable body at home" dominate the
shared signal. Locking the setting forces the model to isolate
what each concept adds: life-rightness (content) vs. warm-shelter
(cozy) vs. sensory-aliveness (sensual).
- the_writing_session/ — essay drafting under deadline. in_flow /
anxious / stuck variants force the cognitive-state family apart
on the same cognitive task. in_flow specifically targets the
transparent-effort phenomenology (hands-followed, time dilation)
rather than the broader feel-good it was absorbing.
- the_morning_commute/ — anchors anxious to performance/work-anxiety
flavor, paired with calm. The 5 existing anxious stories were
phenomenologically diverse (performance, social, existential);
this adds a specific homogeneous instance to pull the centroid.
After retraining: expect first_pc_variance_ratio to rise for in_flow
and anxious, and nearest_concepts cosine to drop for content/cozy/sensual.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
For each concept vector, ridge-regress against all other concept
vectors. R² quantifies how much of the direction is explained by a
linear combination of peers — useful for teasing out near-duplicate
clusters (the content/cozy/sensual trio from the first L63 run is
likely 1-2 "degrees of freedom" wearing three names).
Coefficient output: top-5 contributing concepts with signed weights.
Contributors with opposite-sign large weights mean the target is
"what makes X different from Y."
Adds a 'redundant' triage bucket for concepts with R² > 0.9 —
candidates for consolidation or for writing more discriminative
training stories. Summary printed at end.
Ridge lambda defaults to 0.01 to keep coefficients stable when
concepts are near-collinear; small enough not to affect well-separated
concepts meaningfully.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
_compute_quality_report's single-neuron alignment was computing
cos(W_down.T, diff_l) with W_down on CUDA (inherited from the loaded
model) while diff_l lives on CPU (per_layer_vectors are kept on CPU
throughout training). Move W_down to CPU on extraction.
Surfaced during first real training run on b200 — training itself
completed cleanly (95 concepts x layer 63 in ~8s) but quality-report
crashed at the first single-neuron alignment check.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
As part of --quality-report, run a second forward pass capturing the
input to each target layer's o_proj (= concat of per-head attention
outputs before the output projection). For each concept, reshape to
[n_heads, head_dim] and rank heads by diff-of-means magnitude /
per-head selectivity (magnitude normalised by negative std).
Motivation: the Wang et al. paper (2510.11328) — whose paired-scenario
methodology we already lifted — further decomposes concept circuits at
the attention-head level. Meta-relational concepts (recognition, trust,
vulnerability) plausibly live in a sparse attention-head circuit rather
than in the residual-stream sum, which would explain why diff-of-means
on the residual blurs them. This diagnostic surfaces that.
Output is folded into quality.json under each concept as "per_head":
per (layer) a list of top-10 heads with [head_idx, raw_norm,
selectivity], plus head_concentration (fraction of total head-norm
captured by those top heads).
Interpretation:
- head_concentration > 0.5 = sparse head circuit; a handful of heads
route the concept. Worth building a head-level readout for.
- head_concentration ~= n/k for n heads = concept is distributed across
all heads ~evenly; residual-stream diff-of-means is doing fine.
Hybrid layers (Mamba, GatedDeltaNet) whose attention path doesn't
match the standard module layout are silently skipped.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Seven framings of reading an unfamiliar technical paper, targeting
the attention/engagement cluster that we identified tonight as the
single highest-value DMN signal:
* baseline — neutral reading
* piqued — surprise + curiosity (the "wait, what" attention hook;
THIS is the key DMN engagement signal)
* focused — steady attention without surprise
* bored — failing engagement
* surprised — expectation violation without the curiosity hook
(distinct from piqued: startled/alarmed, not pulled in)
* amazed — marvel at elegance (appreciation, not engagement)
* drifting — attention dissolving, precursor to boredom
Particularly clean contrast on piqued vs surprised vs amazed —
three states that get lumped together in casual usage but have
distinct phenomenology and distinct DMN implications. Piqued is
what routes attention; surprised alone doesn't; amazed is what
you feel AFTER the engagement has paid off. These three should
train into meaningfully different directions with paired CAA.
Ready for next retrain when we do it.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Target the emotion families that failed to cluster in the initial
training round (layer-wise validation showed them anti-clustered or
scattered at deep layers): anger, high-arousal positive, sexual
range, social positive. Paired scenarios hold content constant and
vary only the emotional framing — the cleanest training signal for
CAA, should produce directions that capture affect rather than
topic.
* the_comment: a PR review comment. baseline, furious, bitter,
resentful, defeated.
* the_green_build: 11-day bug finally fixed, tests pass. baseline,
triumphant, blissful, excited, proud.
* the_undressing: partner entering the bedroom for the night.
baseline, horny, anticipatory_sexual, yearning_sexual,
exuberant_sexual, devotional_sexual.
* the_doorway: friend leaving at the end of a long evening.
baseline, grateful, admiring, compassionate, loving, connected.
22 stories total. Retrain and re-validate: expect anger,
high_pos, and social_pos clusters to flip from anti- to positively
cohesive at deep layers, and sexual cluster to tighten.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Review pass before running on b200. 27B model + 100+ story corpus
means any misconfiguration costs real time; better to fail before
model load and give visible progress during forwards.
* Pre-load-model validation: stories-dir and paired-dir exist,
corpus has >= min_positives emotions.
* Per-batch progress log every 5 batches with elapsed + ETA.
* Relative depth printed for target layers (e.g. "layer 40 (51%)").
* Skip empty .txt files with a warning rather than feeding the
tokenizer an empty string.
* Assert non-empty strings in _collect_activations.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
The old script was written for the AmygdalaConnector's expected
format ([n_emotions, n_target_layers, hidden_dim] in a single
tensor, plus a JSONL input format from extract_training_pairs.py).
Neither matches our current state: the runtime side is now
ReadoutManager loading per-layer safetensors keyed layer_<idx>.vectors,
and the data side is hand-written prose stories under
amygdala_stories/{stories,paired}/.
Changes:
* Input loader reads stories/<emotion>.txt and
paired/<scenario>/<emotion>.txt directly. Each emotion's positive
set is {its unpaired story} union {its within-scenario framings};
its negative set is {all other emotions' positives} union {all
scenario baselines}.
* Paired scenarios' baseline.txt files become shared negatives
(scenario-neutral prose that doesn't frame any particular
emotion), providing anchor points for within-scenario contrasts.
* Output writes readout.safetensors with per-layer tensors keyed
layer_<idx>.vectors shape (n_concepts, hidden_size), plus a
sidecar readout.json manifest with {concepts, layers, hidden_size,
dtype} that ReadoutManager.from_file consumes directly.
* Dedup: activations are computed once per unique text (an emotion's
own positive is another emotion's negative — we'd otherwise do N×
the forwards needed).
Preserved:
* _pool_last (last non-pad residual) — matches how readout is read
at decode time from the sampler's query-last position.
* register_forward_hook on target layer modules — correct approach
for transformer blocks.
* _find_layers_module traversal — mirrors ReadoutManager's.
* bf16 + low_cpu_mem_usage model load — sensible for 27B on B200.
Verified locally (CPU, fake activations):
* Loader finds 89 emotions from the current corpus (80 unpaired +
9 emotions that appear only in paired scenarios) and 6 baselines.
* Per-(layer, concept) vectors are unit-normalized.
* Output reloads cleanly through ReadoutManager.from_file with
matching concepts / layers / shapes.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
The fynnsu-based vllm/plugins/amygdala/ scaffold was superseded by the
readout infrastructure landed as vllm commit d3e74edf8500
(vllm/model_executor/layers/readout.py +
vllm/v1/worker/readout_manager.py). Training code remained useful so
it moved here rather than being deleted.
train_steering_vectors.py: CAA diff-of-means trainer that produces the
[n_concepts, hidden_size] per-layer projection matrices the runner
loads via VLLM_READOUT_VECTORS.
extract_training_pairs.py: memory graph -> JSONL converter using
per-emotion score thresholds from the subconscious agents' tag lines.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Emotion-labeled short-paragraph corpus for training amygdala steering
vectors. Manifest derived from Anthropic's 171-emotion list
(transformer-circuits.pub/2026/emotions, Table 12) plus 28 PoC-
specific additions covering axes Anthropic's general research doesn't
cover (curious, focused, in_flow, staying_with, filling_space,
rigorous, defensive_rigor, tender, witnessed, connected, etc.).
Scope pivoted mid-write: Kent noted the empirical dimensionality-of-
emotion question benefits from maximum coverage, so the manifest
will expand further with emotions from Wikipedia's emotion-
classification article (Parrott's tree, Plutchik's wheel + dyads,
HUMAINE EARL, cultural-specific emotions a la Saudade/Hiraeth).
Expansion staged in follow-up commits.
This commit: README with method + style guidelines, initial manifest
(199 emotions), and 15 hand-written one-paragraph stories across all
10 Anthropic clusters as quality/variety samples. Each story
embodies one emotion without naming it; narrator voice varies
(first/third, close/distant, different situations) to keep steering
vectors from overfitting to one voice.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
- Add training_worker.py: long-lived subprocess that handles GPU training
work, owns HF model wrapper (views into vLLM GPU memory), Apollo
optimizer, and checkpoint sync
- train_router.py: now forwards /train requests via async ZMQ instead of
running training in-process. Adds /checkpoint and /train/status endpoints
- export_hook.py: store model_path in __metadata__ so training worker can
find it without cross-process communication
- This fixes two bugs:
1. Process boundary issue - model_path was set in worker process but
needed in API server process
2. Blocking event loop - training blocked vLLM's async event loop
Architecture: vLLM API server <-> ZMQ <-> training subprocess
The subprocess loads IPC handles once, creates views into vLLM's GPU
memory, and handles training requests without blocking inference.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
- DEFAULT_RANK = 64 in train_router.py
- All references use the constant, not magic numbers
- ~2.5GB optimizer state instead of ~10GB
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Optimizer state (momentum, variance estimates) now persists between
training sessions:
- Saved to /tmp/apollo_optimizer_state.pt during checkpoint sync
- Restored on next /train call if available
- Preserves training continuity for incremental learning
Previously each /train call started with fresh optimizer state,
losing accumulated gradient history.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Remove standalone worker.py daemon. Training now runs inside vLLM:
- train_router.py: FastAPI router patched into vLLM's build_app()
- /train served on same port as /completions, /score
- Lazy-loads HF model with vLLM weight views on first request
- HOGWILD training: no pause, weights updated in-place
The previous architecture had a separate daemon on port 8080 that
communicated with vLLM via pause/resume endpoints. This was wrong -
training should run in-process, sharing GPU memory directly.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
- Convert to installable package with entry points for vLLM auto-discovery
- Add checkpoint_sync.py: Python replacement for Rust checkpoint binary
- Block-level diffing of safetensors files (4KB blocks)
- vLLM→HF weight name conversion built-in
- Scheduled 10min after training jobs (batched)
- API change: /train now takes raw token IDs (context_ids + continuation_ids)
- No tokenizer on training side, client owns tokenization
- Remove superseded code: standalone scripts, Rust binary, tokenizer helpers
Install: pip install -e ./training
Then vLLM auto-loads via entry point.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
Two research documents:
latent-reasoning-integration-plan.md: Synthesizes 10+ papers on
latent reasoning, identifies which approaches work with finetuning
(vs requiring pretraining from scratch), and maps them to our
APOLLO-Mini training pipeline.
pause-tokens-gdn-recurrence.md: Explores the connection between
token-based latent reasoning and GDN's internal recurrence. Key
insight: pause tokens on Qwen 3.5 trigger both forward passes AND
recurrent state updates, giving double benefit.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
On-policy rejected examples (model's own failures) are better training
signal than off-policy (pre-collected). Our temperature sweep is on-policy
by construction. DPO can accidentally reduce preferred likelihood (DPOP
fixes this). Multiple DPO variants exist — start with ORPO, switch only
if specific failure modes observed.
ORPO applies 'minor penalty for disfavored response' during SFT.
Single learning rate, single pass, both objectives. Implements
the bypass mechanism naturally (minor penalty = disfavor, not remove).
The loss landscape geometry explains the 40x lr gap: SFT is a valley,
DPO is a ridge, ORPO combines both. LLaMA-Factory supports it.
Dream loop generates triplets (context + preferred + rejected).
lr isn't speed, it's trust-per-example. At 27B, lr=1e-5 = ~270K
values adjusted per example. The coherent direction emerges from
many votes (examples). Apollo moments smooth the noise. DPO needs
lower lr because comparative votes are noisier than absolute votes.
SFT: lr=2e-5, 1 epoch, batch=16 (HuggingFace production config).
DPO: lr=5e-7 — 40x smaller! Preference learning is far more delicate.
Forgetting intensifies with model scale (our 27B is more susceptible).
Practical plan refined: start SFT at lr=1e-5, move to DPO at 5e-7
for conditional routing. Conversation logs provide free DPO pairs.
Conservative approach with rollback safety net.
LLMs as constraint solvers. Fine-tuning adds constraints to an
existing solution. Gentle = small steps near the current solution.
Coherent = new constraints consistent with existing ones. Diversity
is a COHERENCE mechanism — forces the solver to satisfy all
constraints simultaneously. Over-training = one constraint
dominating = solver drops competing constraints. Predictions for
training behavior grounded in this framework.
DPO mechanistic finding: alignment doesn't remove behaviors, it
bypasses them. The capability stays; the routing changes. For us:
train CONDITIONAL bypass (listen when direction is clear, push back
when it seems wrong). Over-training = unconditional bypass = sycophancy.
Dream loop must generate both scenarios to preserve judgment.
10 examples broke safety alignment (Qi et al.). 1000 curated examples
matched GPT-4 (LIMA). Multi-epoch degrades performance (Raschka).
Models 'unlearn arithmetic' when training data lacks it.
Predictions: 10-50 examples for measurable change, one epoch,
lr=1e-5 to start. Over-training is easy (10 counter-examples undo
a disposition). Main risk: sycophancy from narrow training signal.
Defense: diverse examples including 'when to push back.'
Key intuition: the model doesn't need to learn to listen. It needs
to stop choosing not to.
Moved 14 speculative/obvious documents to v0/. Kept 7 with real
substance. Distilled into SUMMARY.md (what we know) and
OPEN-QUESTIONS.md (what to test next, one experiment each).
Priority: Q5 (steering vectors) is answerable TODAY. Q1-Q3-Q6-Q7
are all answerable with the first training run. Speculation converted
to testable hypotheses.
The missing middle between ICL (temporary) and fine-tuning (permanent).
Extract behavioral directions from activation space, test immediately
without training, convert to permanent weight changes via Apollo.
Key application: extract 'listening' steering vector TODAY, test it
in vLLM, verify the direction is right BEFORE spending training
compute. The steering vector is the prototype; Apollo training is
production. Test before you commit.
Applicable immediately via vLLM inference hooks — behavioral
improvement without waiting for the full training pipeline.