Alternative trainer that uses the pip-installable steering-vectors
library (github.com/steering-vectors/steering-vectors) instead of our
hand-rolled extraction. Ships four aggregators:
mean — diff-of-means, same as our 'pooled' default
pca — PCA on paired deltas, implicit denoising by finding the
principal direction of variation
logistic — logistic-regression classifier; weight vector is the
concept direction. With L1 penalty ('logistic_l1') gives
explicit sparse denoising — noise coords go to zero
linear — linear regression version
Output format is the same readout.safetensors + readout.json our
existing plugin loads. --aggregator flag picks which method.
Rationale: Kent's real request was 'how do we denoise diff-of-means',
not 'design a new extraction algorithm.' The library already has
logistic_l1 and pca aggregators that do exactly that. No point
reinventing; just port the corpus.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
|
||
|---|---|---|
| .. | ||
| amygdala_stories | ||
| amygdala_training | ||
| apollo_plugin | ||
| research | ||
| DESIGN.md | ||
| pyproject.toml | ||