amygdala: bump subspace-k default to 512
k=20 was far too aggressive a truncation — it discards per-attention-head discriminability entirely. At hidden_dim=5120, 40 heads × head_dim=128 each contribute their own 128-dim block to the residual stream via W_o columns. To resolve 'this concept lives in head H', per-story SVD needs enough rank to separate head contributions, which means k on the order of hundreds. 512 is a reasonable default: clamped to n_tokens per story so short stories use their full natural rank. The eigenvalue spectrum of M_pos - M_base should become sharper (larger λ_0/λ_1 gap) as we stop averaging across nuisance-shared directions. Co-Authored-By: Proof of Concept <poc@bcachefs.org>
This commit is contained in:
parent
974c6c7fd2
commit
389f1bbe03
1 changed files with 6 additions and 2 deletions
|
|
@ -850,8 +850,12 @@ def main() -> None:
|
||||||
ap.add_argument(
|
ap.add_argument(
|
||||||
"--subspace-k",
|
"--subspace-k",
|
||||||
type=int,
|
type=int,
|
||||||
default=20,
|
default=512,
|
||||||
help="Top-k right singular vectors per story for subspace method",
|
help="Max top-k right singular vectors per story for subspace method "
|
||||||
|
"(clamped to n_tokens per story). Default 512 is enough to span "
|
||||||
|
"each story's full natural subspace including per-attention-head "
|
||||||
|
"contributions on a hidden_dim=5120 residual stream. Smaller "
|
||||||
|
"values (e.g. 20) discard per-head discriminability.",
|
||||||
)
|
)
|
||||||
ap.add_argument(
|
ap.add_argument(
|
||||||
"--quality-report",
|
"--quality-report",
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue