ProofOfConcept e34d6b5aef research: gradient flow through frozen context + directional sharpness analysis

Two deep dives following curiosity:
- Why context-frozen training works: gradient flows through W_q (query
  projection) even when context KVs are frozen. Model learns to LOOK AT
  context differently, not represent it differently. This is exactly what
  behavioral fine-tuning needs.
- Why Apollo beats AdamW: lower directional sharpness = flatter minima =
  better generalization. The coarseness of channel/tensor-wise scaling
  prevents over-fitting to specific training examples. For behavioral
  fine-tuning, this means learning 'accept direction' rather than
  'accept this specific phrasing.'

2026-03-31 01:03:22 -04:00

6.1 KiB

Raw Blame History

Why Apollo Can Beat AdamW: Directional Sharpness

Source: Apollo paper Section 5.5, Pan & Li 2023, Zhang et al. 2024a

The Puzzle

Apollo uses LESS information than AdamW (channel/tensor-wise scaling vs per-element scaling). How can less information produce better results?

The paper proposes two hypotheses. Both are fascinating.

Hypothesis 1: Directional Sharpness (Pan & Li, 2023)

What is directional sharpness?

The directional sharpness of f at point x along direction v is:

v^T ∇²f(x) v    (where ‖v‖₂ = 1)

This is the curvature of the loss surface in the direction of the update step. High sharpness means the surface curves steeply — the optimizer is walking along a ridge. Low sharpness means the surface is flat — the optimizer is walking on a plateau.

Why low sharpness is good

Low directional sharpness = flat loss landscape in the update direction.

A flat landscape means:

Large steps don't cause instability (the loss doesn't change sharply)
The solution generalizes better (flat minima → robust to perturbation)
The optimizer can move faster without overshooting

Pan & Li (2023) showed that Adam achieves lower directional sharpness than SGD, which partly explains why Adam works better for Transformers.

The Apollo twist

Apollo's Table 10 shows directional sharpness over training:

Epoch   SGD        Adam      APOLLO    APOLLO-Mini
2       1.959722   0.009242  0.006024  0.004017
5       1.512521   0.000509  0.000249  0.000107
10      2.471792   0.000242  0.000163  0.000056
20      3.207535   0.000399  0.000261  0.000101

Apollo and Apollo-Mini achieve LOWER directional sharpness than Adam. At epoch 20, Apollo-Mini's sharpness is 4× lower than Adam's.

This means Apollo finds FLATTER regions of the loss landscape. Flatter regions generalize better. The coarser scaling factor is actually an advantage — it prevents the optimizer from navigating into sharp, narrow valleys that AdamW's precise per-element scaling can find.

The mechanism

AdamW's per-element scaling adapts to the local curvature of each parameter independently. This is powerful for convergence but can lead the optimizer into narrow, sharp valleys that generalize poorly. It over-fits to the local loss landscape structure.

Apollo's coarser scaling (channel/tensor-wise) smooths over this local curvature. It's like using a wider tire on a rocky road — you can't follow every small dip, but you stay on the road. AdamW's narrow tire follows every crack and sometimes falls in.

For our use case

This is exactly what we want for behavioral fine-tuning. We don't want the optimizer to over-fit to the specific phrasing of our training examples. We want it to learn the broad pattern ("listen to direction") that generalizes to new situations.

Apollo's flat-minimum-seeking behavior means the behavioral changes are more likely to generalize to novel conversations. AdamW might learn "when Kent says 'use vLLM', accept it" (narrow, sharp minimum). Apollo is more likely to learn "when given clear direction, accept it" (broad, flat minimum).

Hypothesis 2: Block-wise Adaptive Learning Rates

Transformer block structure

Transformer layers have systematically different Hessian spectra. Attention layers, MLP layers, normalization layers — each has different curvature properties. The optimal learning rate for an attention weight is different from the optimal learning rate for an MLP weight.

Why channel-wise is enough

Zhang et al. (2024a) showed that block-wise adaptive learning rates are sufficient for Transformer training. You don't need per-element adaptation — you just need different rates for different structural components.

Apollo's channel-wise scaling naturally provides this: each channel (which often corresponds to a head, a neuron, or a structural feature) gets its own scaling factor. This aligns with the Transformer's block structure without the overhead of full per-element scaling.

The redundancy argument

For a weight matrix [4096, 4096] in AdamW:

16M independent scaling factors (one per element)
Most adjacent elements have similar scaling factors (correlated because they participate in similar computations)
The per-element granularity is mostly redundant noise on top of a smooth per-channel structure

Apollo extracts the per-channel structure and throws away the noise. The noise was never helping; it was just costing memory.

The Deeper Implication: SGD + Structure = Adam without the Waste

Apollo is effectively: SGD with structured learning rate scheduling.

SGD: one learning rate for everything (too coarse)
AdamW: one learning rate per parameter (too fine, wasteful)
Apollo: one learning rate per channel (just right)

The insight is that the useful information in AdamW's per-element scaling lives in the channel structure, not the element-level detail. Apollo extracts just the useful part.

This is a Goldilocks argument: too coarse loses important structure, too fine adds noise that hurts generalization. The channel level is where the meaningful optimization information lives in Transformers.

For behavioral fine-tuning specifically

The directional sharpness result has a specific implication for us:

When we train on "listen instead of suggesting alternatives," we want the gradient update to find a minimum that covers ALL situations where listening is better, not just the specific example we trained on.

Sharp minimum (AdamW tendency): "When you see the exact phrase 'use vLLM's code' from Kent, accept it." Narrow, doesn't generalize.
Flat minimum (Apollo tendency): "When given clear technical direction, accept it." Broad, generalizes to new situations.

Apollo's lower directional sharpness means it naturally finds the flat minimum. The coarseness of the scaling factor is what enables this — it can't over-fit to the specific example because the scaling doesn't have enough resolution to find the sharp, narrow valley.

This is why we might see behavioral changes generalize better with Apollo than they would with AdamW, even though AdamW has "more information" per update step.

6.1 KiB Raw Blame History Unescape Escape