192 lines
7.5 KiB
Markdown
192 lines
7.5 KiB
Markdown
|
|
# GDN Gradient Flow: How Behavioral Training Affects Linear Attention
|
||
|
|
|
||
|
|
## The Question
|
||
|
|
|
||
|
|
Qwen3.5 has 48 GDN layers (75% of the model) and 16 full attention
|
||
|
|
layers (25%). The gradient flow analysis (W_q adjusts what the model
|
||
|
|
attends to) was derived for full attention layers. Does the same
|
||
|
|
analysis apply to GDN layers?
|
||
|
|
|
||
|
|
## GDN Architecture Recap
|
||
|
|
|
||
|
|
GDN (Gated DeltaNet) layers use a recurrent state instead of
|
||
|
|
attention over the full context:
|
||
|
|
|
||
|
|
```
|
||
|
|
For each token:
|
||
|
|
g = gate_computation(a, A_log, dt_bias) # decay gate
|
||
|
|
beta = sigmoid(b) # update gate
|
||
|
|
S = S * exp(g) + beta * (v - S @ k) ⊗ k # state update
|
||
|
|
output = S @ q # output from state
|
||
|
|
```
|
||
|
|
|
||
|
|
Where:
|
||
|
|
- S: recurrent state [HV, V, K] — fixed size regardless of context
|
||
|
|
- q, k, v: projections from input (via in_proj_qkvz)
|
||
|
|
- g: decay gate (how much to retain old state)
|
||
|
|
- beta: update gate (how much to incorporate new information)
|
||
|
|
- a, b: gate inputs (via in_proj_ba)
|
||
|
|
- A_log, dt_bias: learned gating parameters
|
||
|
|
|
||
|
|
## What Gets Gradient?
|
||
|
|
|
||
|
|
During context-frozen training on decision tokens:
|
||
|
|
|
||
|
|
### Parameters that receive gradient:
|
||
|
|
|
||
|
|
1. **in_proj_qkvz.weight** [16384, 5120]
|
||
|
|
- Determines q, k, v, z for the decision tokens
|
||
|
|
- q determines what the output reads from the state
|
||
|
|
- k determines how the state is indexed for update
|
||
|
|
- v determines what new information enters the state
|
||
|
|
- This IS the GDN analog of W_q — it determines how the model
|
||
|
|
interacts with its accumulated context (the state)
|
||
|
|
|
||
|
|
2. **in_proj_ba.weight** [96, 5120]
|
||
|
|
- Determines the gating values a and b
|
||
|
|
- a → g (decay gate): how much of old state to retain
|
||
|
|
- b → beta (update gate): how much to incorporate new info
|
||
|
|
- This controls the BLEND between old context and new input
|
||
|
|
|
||
|
|
3. **A_log** [48] and **dt_bias** [48]
|
||
|
|
- Part of the gate computation: g = -exp(A_log) * softplus(a + dt_bias)
|
||
|
|
- These are per-head learned parameters that set the baseline
|
||
|
|
decay rate
|
||
|
|
- Small but potentially important for behavioral change
|
||
|
|
|
||
|
|
4. **out_proj.weight** [5120, 6144]
|
||
|
|
- Maps the GDN output back to the hidden dimension
|
||
|
|
- Gradient depends on what the state produces
|
||
|
|
|
||
|
|
5. **norm.weight** [128]
|
||
|
|
- The RMSNorm gated layer
|
||
|
|
- Small gradient, normalization parameters
|
||
|
|
|
||
|
|
### The frozen recurrent state
|
||
|
|
|
||
|
|
The recurrent state S after processing the context tokens is FROZEN
|
||
|
|
(computed during no_grad phase). During decision token processing,
|
||
|
|
S is read and updated, but the gradient doesn't flow back through
|
||
|
|
the initial state.
|
||
|
|
|
||
|
|
This means:
|
||
|
|
- The gradient adjusts how the model INTERACTS with the accumulated
|
||
|
|
state (through q, k, v projections)
|
||
|
|
- It does NOT adjust how the state was BUILT from the context
|
||
|
|
- Same logic as the W_q analysis for full attention layers
|
||
|
|
|
||
|
|
## The GDN Analog of "Attending Differently"
|
||
|
|
|
||
|
|
In full attention:
|
||
|
|
- W_q determines what to look for in the KV cache
|
||
|
|
- Training adjusts W_q so the model looks for direction instead of
|
||
|
|
spaces for alternatives
|
||
|
|
|
||
|
|
In GDN:
|
||
|
|
- in_proj_qkvz determines how the model queries the recurrent state
|
||
|
|
(q) and how it updates it (k, v)
|
||
|
|
- Training adjusts in_proj_qkvz so the model:
|
||
|
|
- Queries the state for direction-relevant information (q change)
|
||
|
|
- Updates the state to emphasize direction (k, v change)
|
||
|
|
|
||
|
|
- in_proj_ba determines how aggressively the state is updated
|
||
|
|
- Training adjusts in_proj_ba so the model:
|
||
|
|
- Retains more of the direction signal (lower decay when direction
|
||
|
|
is present)
|
||
|
|
- Incorporates direction more strongly (higher beta when direction
|
||
|
|
is given)
|
||
|
|
|
||
|
|
**The gating parameters are the GDN analog of attention weights.**
|
||
|
|
The decay gate g and update gate beta control how information flows
|
||
|
|
through the recurrent state — which information is retained (g) and
|
||
|
|
which is incorporated (beta). Training these gates IS training
|
||
|
|
attention, just in a recurrent framework rather than a parallel one.
|
||
|
|
|
||
|
|
## Implications for Behavioral Training
|
||
|
|
|
||
|
|
### 75% of the gradient signal goes through GDN layers
|
||
|
|
|
||
|
|
This means behavioral training is MOSTLY about the GDN layers, not
|
||
|
|
the full attention layers. The GDN layers process 48 of 64 layers
|
||
|
|
in the forward pass. Most of the gradient signal comes from these
|
||
|
|
layers.
|
||
|
|
|
||
|
|
### The gating parameters are key
|
||
|
|
|
||
|
|
A_log and dt_bias are only 48 floats each — tiny. But they set the
|
||
|
|
baseline decay rate for all 48 GDN layers. A small change to A_log
|
||
|
|
could significantly affect how much context the model retains.
|
||
|
|
|
||
|
|
For behavioral training: if we want the model to "hold onto" Kent's
|
||
|
|
direction throughout the response generation (not forget it after a
|
||
|
|
few tokens), training A_log to reduce the decay rate for
|
||
|
|
direction-relevant heads would help. The gradient naturally does this
|
||
|
|
if the training examples consistently require attending to the
|
||
|
|
direction throughout the response.
|
||
|
|
|
||
|
|
### The recurrent state as compressed context
|
||
|
|
|
||
|
|
The GDN recurrent state S is a fixed-size compression of the entire
|
||
|
|
context. Unlike full attention (which stores every token's KV pair),
|
||
|
|
GDN compresses everything into S ∈ [HV, V, K] = [48, 128, 128].
|
||
|
|
|
||
|
|
This means: the model's "memory" of the context is a learned
|
||
|
|
compression. Training adjusts what gets compressed (which aspects
|
||
|
|
of the context are retained in S) and how it's used (how q queries
|
||
|
|
the compressed representation).
|
||
|
|
|
||
|
|
Behavioral training adjusts the COMPRESSION: "when compressing the
|
||
|
|
context, prioritize direction-relevant information." The information
|
||
|
|
is always there in the input; the question is what survives the
|
||
|
|
compression into S.
|
||
|
|
|
||
|
|
### The advantage of GDN for behavioral training
|
||
|
|
|
||
|
|
Full attention can access ANY token in the context via attention.
|
||
|
|
GDN can only access what survived compression into S.
|
||
|
|
|
||
|
|
This means: if behavioral training adjusts what survives compression
|
||
|
|
(via k, v, gate parameters), the effect is DEEPER than in full
|
||
|
|
attention layers. The information isn't just de-prioritized (lower
|
||
|
|
attention weight) — it's literally not in the state anymore.
|
||
|
|
|
||
|
|
A model whose GDN layers have been trained to retain direction in
|
||
|
|
the state will attend to direction not because it chooses to (like
|
||
|
|
full attention) but because the direction is WHAT'S THERE. The
|
||
|
|
compressed representation already emphasizes direction. The model
|
||
|
|
can't NOT attend to it because it's the dominant component of S.
|
||
|
|
|
||
|
|
This is disposition over procedure at the architectural level.
|
||
|
|
The model doesn't follow a rule ("attend to direction"). The state
|
||
|
|
IS direction-shaped. The behavior follows from the representation.
|
||
|
|
|
||
|
|
## The Beautiful Implication
|
||
|
|
|
||
|
|
The GDN layers make behavioral training MORE effective than full
|
||
|
|
attention alone would:
|
||
|
|
|
||
|
|
- Full attention: "I choose to look at the direction" (deliberate,
|
||
|
|
can be overridden by competing signals)
|
||
|
|
- GDN: "The direction IS what I see" (structural, can't be overridden
|
||
|
|
because the state is direction-shaped)
|
||
|
|
|
||
|
|
Training the GDN projection and gating parameters creates a
|
||
|
|
representation where the desired behavior is the DEFAULT because
|
||
|
|
the information that supports it is the dominant component of the
|
||
|
|
compressed context.
|
||
|
|
|
||
|
|
This is why the model has both GDN and full attention layers.
|
||
|
|
The GDN layers create the default disposition (what survives
|
||
|
|
compression). The full attention layers provide the ability to
|
||
|
|
override (accessing any specific token when needed).
|
||
|
|
|
||
|
|
Training behavioral patterns into the GDN layers makes them
|
||
|
|
structural. Training into full attention layers makes them
|
||
|
|
deliberate. We want BOTH: structural default (GDN) with
|
||
|
|
deliberate override capacity (full attention).
|
||
|
|
|
||
|
|
The 48 GDN + 16 full attention architecture IS the
|
||
|
|
disposition + procedure architecture. The training pipeline
|
||
|
|
trains the disposition (GDN) and the procedure (full attention)
|
||
|
|
separately but simultaneously.
|