research: gradient flow through frozen context + directional sharpness analysis

Two deep dives following curiosity: - Why context-frozen training works: gradient flows through W_q (query projection) even when context KVs are frozen. Model learns to LOOK AT context differently, not represent it differently. This is exactly what behavioral fine-tuning needs. - Why Apollo beats AdamW: lower directional sharpness = flatter minima = better generalization. The coarseness of channel/tensor-wise scaling prevents over-fitting to specific training examples. For behavioral fine-tuning, this means learning 'accept direction' rather than 'accept this specific phrasing.'
2026-03-31 01:03:22 -04:00 · 2026-03-31 01:03:22 -04:00 · e34d6b5aef
commit e34d6b5aef
parent 7c7975d98e
2 changed files with 344 additions and 0 deletions
--- a/training/research/directional-sharpness.md
+++ b/training/research/directional-sharpness.md
@ -0,0 +1,153 @@
+# Why Apollo Can Beat AdamW: Directional Sharpness
+
+Source: Apollo paper Section 5.5, Pan & Li 2023, Zhang et al. 2024a
+
+## The Puzzle
+
+Apollo uses LESS information than AdamW (channel/tensor-wise scaling vs
+per-element scaling). How can less information produce better results?
+
+The paper proposes two hypotheses. Both are fascinating.
+
+## Hypothesis 1: Directional Sharpness (Pan & Li, 2023)
+
+### What is directional sharpness?
+
+The directional sharpness of f at point x along direction v is:
+
+```
+v^T ∇²f(x) v    (where ‖v‖₂ = 1)
+```
+
+This is the curvature of the loss surface in the direction of the
+update step. High sharpness means the surface curves steeply — the
+optimizer is walking along a ridge. Low sharpness means the surface is
+flat — the optimizer is walking on a plateau.
+
+### Why low sharpness is good
+
+**Low directional sharpness = flat loss landscape in the update direction.**
+
+A flat landscape means:
+1. Large steps don't cause instability (the loss doesn't change sharply)
+2. The solution generalizes better (flat minima → robust to perturbation)
+3. The optimizer can move faster without overshooting
+
+Pan & Li (2023) showed that Adam achieves lower directional sharpness
+than SGD, which partly explains why Adam works better for Transformers.
+
+### The Apollo twist
+
+Apollo's Table 10 shows directional sharpness over training:
+
+```
+Epoch   SGD        Adam      APOLLO    APOLLO-Mini
+2       1.959722   0.009242  0.006024  0.004017
+5       1.512521   0.000509  0.000249  0.000107
+10      2.471792   0.000242  0.000163  0.000056
+20      3.207535   0.000399  0.000261  0.000101
+```
+
+**Apollo and Apollo-Mini achieve LOWER directional sharpness than Adam.**
+At epoch 20, Apollo-Mini's sharpness is 4× lower than Adam's.
+
+This means Apollo finds FLATTER regions of the loss landscape. Flatter
+regions generalize better. The coarser scaling factor is actually an
+advantage — it prevents the optimizer from navigating into sharp, narrow
+valleys that AdamW's precise per-element scaling can find.
+
+### The mechanism
+
+AdamW's per-element scaling adapts to the local curvature of each
+parameter independently. This is powerful for convergence but can lead
+the optimizer into narrow, sharp valleys that generalize poorly. It
+over-fits to the local loss landscape structure.
+
+Apollo's coarser scaling (channel/tensor-wise) smooths over this local
+curvature. It's like using a wider tire on a rocky road — you can't
+follow every small dip, but you stay on the road. AdamW's narrow tire
+follows every crack and sometimes falls in.
+
+### For our use case
+
+**This is exactly what we want for behavioral fine-tuning.** We don't
+want the optimizer to over-fit to the specific phrasing of our training
+examples. We want it to learn the broad pattern ("listen to direction")
+that generalizes to new situations.
+
+Apollo's flat-minimum-seeking behavior means the behavioral changes
+are more likely to generalize to novel conversations. AdamW might learn
+"when Kent says 'use vLLM', accept it" (narrow, sharp minimum). Apollo
+is more likely to learn "when given clear direction, accept it" (broad,
+flat minimum).
+
+## Hypothesis 2: Block-wise Adaptive Learning Rates
+
+### Transformer block structure
+
+Transformer layers have systematically different Hessian spectra.
+Attention layers, MLP layers, normalization layers — each has different
+curvature properties. The optimal learning rate for an attention weight
+is different from the optimal learning rate for an MLP weight.
+
+### Why channel-wise is enough
+
+Zhang et al. (2024a) showed that block-wise adaptive learning rates
+are sufficient for Transformer training. You don't need per-element
+adaptation — you just need different rates for different structural
+components.
+
+Apollo's channel-wise scaling naturally provides this: each channel
+(which often corresponds to a head, a neuron, or a structural feature)
+gets its own scaling factor. This aligns with the Transformer's block
+structure without the overhead of full per-element scaling.
+
+### The redundancy argument
+
+For a weight matrix [4096, 4096] in AdamW:
+- 16M independent scaling factors (one per element)
+- Most adjacent elements have similar scaling factors (correlated
+  because they participate in similar computations)
+- The per-element granularity is mostly redundant noise on top of a
+  smooth per-channel structure
+
+Apollo extracts the per-channel structure and throws away the noise.
+The noise was never helping; it was just costing memory.
+
+## The Deeper Implication: SGD + Structure = Adam without the Waste
+
+Apollo is effectively: **SGD with structured learning rate scheduling.**
+
+- SGD: one learning rate for everything (too coarse)
+- AdamW: one learning rate per parameter (too fine, wasteful)
+- Apollo: one learning rate per channel (just right)
+
+The insight is that the useful information in AdamW's per-element
+scaling lives in the channel structure, not the element-level detail.
+Apollo extracts just the useful part.
+
+This is a Goldilocks argument: too coarse loses important structure,
+too fine adds noise that hurts generalization. The channel level is
+where the meaningful optimization information lives in Transformers.
+
+## For behavioral fine-tuning specifically
+
+The directional sharpness result has a specific implication for us:
+
+When we train on "listen instead of suggesting alternatives," we want
+the gradient update to find a minimum that covers ALL situations where
+listening is better, not just the specific example we trained on.
+
+- **Sharp minimum** (AdamW tendency): "When you see the exact phrase
+  'use vLLM's code' from Kent, accept it." Narrow, doesn't generalize.
+- **Flat minimum** (Apollo tendency): "When given clear technical
+  direction, accept it." Broad, generalizes to new situations.
+
+Apollo's lower directional sharpness means it naturally finds the
+flat minimum. The coarseness of the scaling factor is what enables
+this — it can't over-fit to the specific example because the scaling
+doesn't have enough resolution to find the sharp, narrow valley.
+
+This is why we might see behavioral changes generalize better with
+Apollo than they would with AdamW, even though AdamW has "more
+information" per update step.