Two deep dives following curiosity:
- Why context-frozen training works: gradient flows through W_q (query
projection) even when context KVs are frozen. Model learns to LOOK AT
context differently, not represent it differently. This is exactly what
behavioral fine-tuning needs.
- Why Apollo beats AdamW: lower directional sharpness = flatter minima =
better generalization. The coarseness of channel/tensor-wise scaling
prevents over-fitting to specific training examples. For behavioral
fine-tuning, this means learning 'accept direction' rather than
'accept this specific phrasing.'