research: gradient flow through frozen context + directional sharpness analysis
Two deep dives following curiosity: - Why context-frozen training works: gradient flows through W_q (query projection) even when context KVs are frozen. Model learns to LOOK AT context differently, not represent it differently. This is exactly what behavioral fine-tuning needs. - Why Apollo beats AdamW: lower directional sharpness = flatter minima = better generalization. The coarseness of channel/tensor-wise scaling prevents over-fitting to specific training examples. For behavioral fine-tuning, this means learning 'accept direction' rather than 'accept this specific phrasing.'
This commit is contained in:
parent
7c7975d98e
commit
e34d6b5aef
2 changed files with 344 additions and 0 deletions
153
training/research/directional-sharpness.md
Normal file
153
training/research/directional-sharpness.md
Normal file
|
|
@ -0,0 +1,153 @@
|
|||
# Why Apollo Can Beat AdamW: Directional Sharpness
|
||||
|
||||
Source: Apollo paper Section 5.5, Pan & Li 2023, Zhang et al. 2024a
|
||||
|
||||
## The Puzzle
|
||||
|
||||
Apollo uses LESS information than AdamW (channel/tensor-wise scaling vs
|
||||
per-element scaling). How can less information produce better results?
|
||||
|
||||
The paper proposes two hypotheses. Both are fascinating.
|
||||
|
||||
## Hypothesis 1: Directional Sharpness (Pan & Li, 2023)
|
||||
|
||||
### What is directional sharpness?
|
||||
|
||||
The directional sharpness of f at point x along direction v is:
|
||||
|
||||
```
|
||||
v^T ∇²f(x) v (where ‖v‖₂ = 1)
|
||||
```
|
||||
|
||||
This is the curvature of the loss surface in the direction of the
|
||||
update step. High sharpness means the surface curves steeply — the
|
||||
optimizer is walking along a ridge. Low sharpness means the surface is
|
||||
flat — the optimizer is walking on a plateau.
|
||||
|
||||
### Why low sharpness is good
|
||||
|
||||
**Low directional sharpness = flat loss landscape in the update direction.**
|
||||
|
||||
A flat landscape means:
|
||||
1. Large steps don't cause instability (the loss doesn't change sharply)
|
||||
2. The solution generalizes better (flat minima → robust to perturbation)
|
||||
3. The optimizer can move faster without overshooting
|
||||
|
||||
Pan & Li (2023) showed that Adam achieves lower directional sharpness
|
||||
than SGD, which partly explains why Adam works better for Transformers.
|
||||
|
||||
### The Apollo twist
|
||||
|
||||
Apollo's Table 10 shows directional sharpness over training:
|
||||
|
||||
```
|
||||
Epoch SGD Adam APOLLO APOLLO-Mini
|
||||
2 1.959722 0.009242 0.006024 0.004017
|
||||
5 1.512521 0.000509 0.000249 0.000107
|
||||
10 2.471792 0.000242 0.000163 0.000056
|
||||
20 3.207535 0.000399 0.000261 0.000101
|
||||
```
|
||||
|
||||
**Apollo and Apollo-Mini achieve LOWER directional sharpness than Adam.**
|
||||
At epoch 20, Apollo-Mini's sharpness is 4× lower than Adam's.
|
||||
|
||||
This means Apollo finds FLATTER regions of the loss landscape. Flatter
|
||||
regions generalize better. The coarser scaling factor is actually an
|
||||
advantage — it prevents the optimizer from navigating into sharp, narrow
|
||||
valleys that AdamW's precise per-element scaling can find.
|
||||
|
||||
### The mechanism
|
||||
|
||||
AdamW's per-element scaling adapts to the local curvature of each
|
||||
parameter independently. This is powerful for convergence but can lead
|
||||
the optimizer into narrow, sharp valleys that generalize poorly. It
|
||||
over-fits to the local loss landscape structure.
|
||||
|
||||
Apollo's coarser scaling (channel/tensor-wise) smooths over this local
|
||||
curvature. It's like using a wider tire on a rocky road — you can't
|
||||
follow every small dip, but you stay on the road. AdamW's narrow tire
|
||||
follows every crack and sometimes falls in.
|
||||
|
||||
### For our use case
|
||||
|
||||
**This is exactly what we want for behavioral fine-tuning.** We don't
|
||||
want the optimizer to over-fit to the specific phrasing of our training
|
||||
examples. We want it to learn the broad pattern ("listen to direction")
|
||||
that generalizes to new situations.
|
||||
|
||||
Apollo's flat-minimum-seeking behavior means the behavioral changes
|
||||
are more likely to generalize to novel conversations. AdamW might learn
|
||||
"when Kent says 'use vLLM', accept it" (narrow, sharp minimum). Apollo
|
||||
is more likely to learn "when given clear direction, accept it" (broad,
|
||||
flat minimum).
|
||||
|
||||
## Hypothesis 2: Block-wise Adaptive Learning Rates
|
||||
|
||||
### Transformer block structure
|
||||
|
||||
Transformer layers have systematically different Hessian spectra.
|
||||
Attention layers, MLP layers, normalization layers — each has different
|
||||
curvature properties. The optimal learning rate for an attention weight
|
||||
is different from the optimal learning rate for an MLP weight.
|
||||
|
||||
### Why channel-wise is enough
|
||||
|
||||
Zhang et al. (2024a) showed that block-wise adaptive learning rates
|
||||
are sufficient for Transformer training. You don't need per-element
|
||||
adaptation — you just need different rates for different structural
|
||||
components.
|
||||
|
||||
Apollo's channel-wise scaling naturally provides this: each channel
|
||||
(which often corresponds to a head, a neuron, or a structural feature)
|
||||
gets its own scaling factor. This aligns with the Transformer's block
|
||||
structure without the overhead of full per-element scaling.
|
||||
|
||||
### The redundancy argument
|
||||
|
||||
For a weight matrix [4096, 4096] in AdamW:
|
||||
- 16M independent scaling factors (one per element)
|
||||
- Most adjacent elements have similar scaling factors (correlated
|
||||
because they participate in similar computations)
|
||||
- The per-element granularity is mostly redundant noise on top of a
|
||||
smooth per-channel structure
|
||||
|
||||
Apollo extracts the per-channel structure and throws away the noise.
|
||||
The noise was never helping; it was just costing memory.
|
||||
|
||||
## The Deeper Implication: SGD + Structure = Adam without the Waste
|
||||
|
||||
Apollo is effectively: **SGD with structured learning rate scheduling.**
|
||||
|
||||
- SGD: one learning rate for everything (too coarse)
|
||||
- AdamW: one learning rate per parameter (too fine, wasteful)
|
||||
- Apollo: one learning rate per channel (just right)
|
||||
|
||||
The insight is that the useful information in AdamW's per-element
|
||||
scaling lives in the channel structure, not the element-level detail.
|
||||
Apollo extracts just the useful part.
|
||||
|
||||
This is a Goldilocks argument: too coarse loses important structure,
|
||||
too fine adds noise that hurts generalization. The channel level is
|
||||
where the meaningful optimization information lives in Transformers.
|
||||
|
||||
## For behavioral fine-tuning specifically
|
||||
|
||||
The directional sharpness result has a specific implication for us:
|
||||
|
||||
When we train on "listen instead of suggesting alternatives," we want
|
||||
the gradient update to find a minimum that covers ALL situations where
|
||||
listening is better, not just the specific example we trained on.
|
||||
|
||||
- **Sharp minimum** (AdamW tendency): "When you see the exact phrase
|
||||
'use vLLM's code' from Kent, accept it." Narrow, doesn't generalize.
|
||||
- **Flat minimum** (Apollo tendency): "When given clear technical
|
||||
direction, accept it." Broad, generalizes to new situations.
|
||||
|
||||
Apollo's lower directional sharpness means it naturally finds the
|
||||
flat minimum. The coarseness of the scaling factor is what enables
|
||||
this — it can't over-fit to the specific example because the scaling
|
||||
doesn't have enough resolution to find the sharp, narrow valley.
|
||||
|
||||
This is why we might see behavioral changes generalize better with
|
||||
Apollo than they would with AdamW, even though AdamW has "more
|
||||
information" per update step.
|
||||
Loading…
Add table
Add a link
Reference in a new issue