amygdala: bump subspace-k default to 512

k=20 was far too aggressive a truncation — it discards per-attention-head
discriminability entirely. At hidden_dim=5120, 40 heads × head_dim=128 each
contribute their own 128-dim block to the residual stream via W_o columns.
To resolve 'this concept lives in head H', per-story SVD needs enough rank
to separate head contributions, which means k on the order of hundreds.

512 is a reasonable default: clamped to n_tokens per story so short stories
use their full natural rank. The eigenvalue spectrum of M_pos - M_base
should become sharper (larger λ_0/λ_1 gap) as we stop averaging across
nuisance-shared directions.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
Kent Overstreet 2026-04-18 21:41:00 -04:00
parent 974c6c7fd2
commit 389f1bbe03

View file

@ -850,8 +850,12 @@ def main() -> None:
ap.add_argument(
"--subspace-k",
type=int,
default=20,
help="Top-k right singular vectors per story for subspace method",
default=512,
help="Max top-k right singular vectors per story for subspace method "
"(clamped to n_tokens per story). Default 512 is enough to span "
"each story's full natural subspace including per-attention-head "
"contributions on a hidden_dim=5120 residual stream. Smaller "
"values (e.g. 20) discard per-head discriminability.",
)
ap.add_argument(
"--quality-report",