forked from kent/consciousness
Two deep dives following curiosity: - Why context-frozen training works: gradient flows through W_q (query projection) even when context KVs are frozen. Model learns to LOOK AT context differently, not represent it differently. This is exactly what behavioral fine-tuning needs. - Why Apollo beats AdamW: lower directional sharpness = flatter minima = better generalization. The coarseness of channel/tensor-wise scaling prevents over-fitting to specific training examples. For behavioral fine-tuning, this means learning 'accept direction' rather than 'accept this specific phrasing.' |
||
|---|---|---|
| .. | ||
| apollo-paper-analysis.md | ||
| catastrophic-forgetting.md | ||
| context-frozen-training.md | ||
| directional-sharpness.md | ||
| gradient-flow-frozen-context.md | ||
| hogwild-convergence.md | ||