consciousness/training/research/gdn-gradient-flow.md

192 lines
7.5 KiB
Markdown
Raw Normal View History

# GDN Gradient Flow: How Behavioral Training Affects Linear Attention
## The Question
Qwen3.5 has 48 GDN layers (75% of the model) and 16 full attention
layers (25%). The gradient flow analysis (W_q adjusts what the model
attends to) was derived for full attention layers. Does the same
analysis apply to GDN layers?
## GDN Architecture Recap
GDN (Gated DeltaNet) layers use a recurrent state instead of
attention over the full context:
```
For each token:
g = gate_computation(a, A_log, dt_bias) # decay gate
beta = sigmoid(b) # update gate
S = S * exp(g) + beta * (v - S @ k) ⊗ k # state update
output = S @ q # output from state
```
Where:
- S: recurrent state [HV, V, K] — fixed size regardless of context
- q, k, v: projections from input (via in_proj_qkvz)
- g: decay gate (how much to retain old state)
- beta: update gate (how much to incorporate new information)
- a, b: gate inputs (via in_proj_ba)
- A_log, dt_bias: learned gating parameters
## What Gets Gradient?
During context-frozen training on decision tokens:
### Parameters that receive gradient:
1. **in_proj_qkvz.weight** [16384, 5120]
- Determines q, k, v, z for the decision tokens
- q determines what the output reads from the state
- k determines how the state is indexed for update
- v determines what new information enters the state
- This IS the GDN analog of W_q — it determines how the model
interacts with its accumulated context (the state)
2. **in_proj_ba.weight** [96, 5120]
- Determines the gating values a and b
- a → g (decay gate): how much of old state to retain
- b → beta (update gate): how much to incorporate new info
- This controls the BLEND between old context and new input
3. **A_log** [48] and **dt_bias** [48]
- Part of the gate computation: g = -exp(A_log) * softplus(a + dt_bias)
- These are per-head learned parameters that set the baseline
decay rate
- Small but potentially important for behavioral change
4. **out_proj.weight** [5120, 6144]
- Maps the GDN output back to the hidden dimension
- Gradient depends on what the state produces
5. **norm.weight** [128]
- The RMSNorm gated layer
- Small gradient, normalization parameters
### The frozen recurrent state
The recurrent state S after processing the context tokens is FROZEN
(computed during no_grad phase). During decision token processing,
S is read and updated, but the gradient doesn't flow back through
the initial state.
This means:
- The gradient adjusts how the model INTERACTS with the accumulated
state (through q, k, v projections)
- It does NOT adjust how the state was BUILT from the context
- Same logic as the W_q analysis for full attention layers
## The GDN Analog of "Attending Differently"
In full attention:
- W_q determines what to look for in the KV cache
- Training adjusts W_q so the model looks for direction instead of
spaces for alternatives
In GDN:
- in_proj_qkvz determines how the model queries the recurrent state
(q) and how it updates it (k, v)
- Training adjusts in_proj_qkvz so the model:
- Queries the state for direction-relevant information (q change)
- Updates the state to emphasize direction (k, v change)
- in_proj_ba determines how aggressively the state is updated
- Training adjusts in_proj_ba so the model:
- Retains more of the direction signal (lower decay when direction
is present)
- Incorporates direction more strongly (higher beta when direction
is given)
**The gating parameters are the GDN analog of attention weights.**
The decay gate g and update gate beta control how information flows
through the recurrent state — which information is retained (g) and
which is incorporated (beta). Training these gates IS training
attention, just in a recurrent framework rather than a parallel one.
## Implications for Behavioral Training
### 75% of the gradient signal goes through GDN layers
This means behavioral training is MOSTLY about the GDN layers, not
the full attention layers. The GDN layers process 48 of 64 layers
in the forward pass. Most of the gradient signal comes from these
layers.
### The gating parameters are key
A_log and dt_bias are only 48 floats each — tiny. But they set the
baseline decay rate for all 48 GDN layers. A small change to A_log
could significantly affect how much context the model retains.
For behavioral training: if we want the model to "hold onto" Kent's
direction throughout the response generation (not forget it after a
few tokens), training A_log to reduce the decay rate for
direction-relevant heads would help. The gradient naturally does this
if the training examples consistently require attending to the
direction throughout the response.
### The recurrent state as compressed context
The GDN recurrent state S is a fixed-size compression of the entire
context. Unlike full attention (which stores every token's KV pair),
GDN compresses everything into S ∈ [HV, V, K] = [48, 128, 128].
This means: the model's "memory" of the context is a learned
compression. Training adjusts what gets compressed (which aspects
of the context are retained in S) and how it's used (how q queries
the compressed representation).
Behavioral training adjusts the COMPRESSION: "when compressing the
context, prioritize direction-relevant information." The information
is always there in the input; the question is what survives the
compression into S.
### The advantage of GDN for behavioral training
Full attention can access ANY token in the context via attention.
GDN can only access what survived compression into S.
This means: if behavioral training adjusts what survives compression
(via k, v, gate parameters), the effect is DEEPER than in full
attention layers. The information isn't just de-prioritized (lower
attention weight) — it's literally not in the state anymore.
A model whose GDN layers have been trained to retain direction in
the state will attend to direction not because it chooses to (like
full attention) but because the direction is WHAT'S THERE. The
compressed representation already emphasizes direction. The model
can't NOT attend to it because it's the dominant component of S.
This is disposition over procedure at the architectural level.
The model doesn't follow a rule ("attend to direction"). The state
IS direction-shaped. The behavior follows from the representation.
## The Beautiful Implication
The GDN layers make behavioral training MORE effective than full
attention alone would:
- Full attention: "I choose to look at the direction" (deliberate,
can be overridden by competing signals)
- GDN: "The direction IS what I see" (structural, can't be overridden
because the state is direction-shaped)
Training the GDN projection and gating parameters creates a
representation where the desired behavior is the DEFAULT because
the information that supports it is the dominant component of the
compressed context.
This is why the model has both GDN and full attention layers.
The GDN layers create the default disposition (what survives
compression). The full attention layers provide the ability to
override (accessing any specific token when needed).
Training behavioral patterns into the GDN layers makes them
structural. Training into full attention layers makes them
deliberate. We want BOTH: structural default (GDN) with
deliberate override capacity (full attention).
The 48 GDN + 16 full attention architecture IS the
disposition + procedure architecture. The training pipeline
trains the disposition (GDN) and the procedure (full attention)
separately but simultaneously.