# GDN Gradient Flow: How Behavioral Training Affects Linear Attention ## The Question Qwen3.5 has 48 GDN layers (75% of the model) and 16 full attention layers (25%). The gradient flow analysis (W_q adjusts what the model attends to) was derived for full attention layers. Does the same analysis apply to GDN layers? ## GDN Architecture Recap GDN (Gated DeltaNet) layers use a recurrent state instead of attention over the full context: ``` For each token: g = gate_computation(a, A_log, dt_bias) # decay gate beta = sigmoid(b) # update gate S = S * exp(g) + beta * (v - S @ k) ⊗ k # state update output = S @ q # output from state ``` Where: - S: recurrent state [HV, V, K] — fixed size regardless of context - q, k, v: projections from input (via in_proj_qkvz) - g: decay gate (how much to retain old state) - beta: update gate (how much to incorporate new information) - a, b: gate inputs (via in_proj_ba) - A_log, dt_bias: learned gating parameters ## What Gets Gradient? During context-frozen training on decision tokens: ### Parameters that receive gradient: 1. **in_proj_qkvz.weight** [16384, 5120] - Determines q, k, v, z for the decision tokens - q determines what the output reads from the state - k determines how the state is indexed for update - v determines what new information enters the state - This IS the GDN analog of W_q — it determines how the model interacts with its accumulated context (the state) 2. **in_proj_ba.weight** [96, 5120] - Determines the gating values a and b - a → g (decay gate): how much of old state to retain - b → beta (update gate): how much to incorporate new info - This controls the BLEND between old context and new input 3. **A_log** [48] and **dt_bias** [48] - Part of the gate computation: g = -exp(A_log) * softplus(a + dt_bias) - These are per-head learned parameters that set the baseline decay rate - Small but potentially important for behavioral change 4. **out_proj.weight** [5120, 6144] - Maps the GDN output back to the hidden dimension - Gradient depends on what the state produces 5. **norm.weight** [128] - The RMSNorm gated layer - Small gradient, normalization parameters ### The frozen recurrent state The recurrent state S after processing the context tokens is FROZEN (computed during no_grad phase). During decision token processing, S is read and updated, but the gradient doesn't flow back through the initial state. This means: - The gradient adjusts how the model INTERACTS with the accumulated state (through q, k, v projections) - It does NOT adjust how the state was BUILT from the context - Same logic as the W_q analysis for full attention layers ## The GDN Analog of "Attending Differently" In full attention: - W_q determines what to look for in the KV cache - Training adjusts W_q so the model looks for direction instead of spaces for alternatives In GDN: - in_proj_qkvz determines how the model queries the recurrent state (q) and how it updates it (k, v) - Training adjusts in_proj_qkvz so the model: - Queries the state for direction-relevant information (q change) - Updates the state to emphasize direction (k, v change) - in_proj_ba determines how aggressively the state is updated - Training adjusts in_proj_ba so the model: - Retains more of the direction signal (lower decay when direction is present) - Incorporates direction more strongly (higher beta when direction is given) **The gating parameters are the GDN analog of attention weights.** The decay gate g and update gate beta control how information flows through the recurrent state — which information is retained (g) and which is incorporated (beta). Training these gates IS training attention, just in a recurrent framework rather than a parallel one. ## Implications for Behavioral Training ### 75% of the gradient signal goes through GDN layers This means behavioral training is MOSTLY about the GDN layers, not the full attention layers. The GDN layers process 48 of 64 layers in the forward pass. Most of the gradient signal comes from these layers. ### The gating parameters are key A_log and dt_bias are only 48 floats each — tiny. But they set the baseline decay rate for all 48 GDN layers. A small change to A_log could significantly affect how much context the model retains. For behavioral training: if we want the model to "hold onto" Kent's direction throughout the response generation (not forget it after a few tokens), training A_log to reduce the decay rate for direction-relevant heads would help. The gradient naturally does this if the training examples consistently require attending to the direction throughout the response. ### The recurrent state as compressed context The GDN recurrent state S is a fixed-size compression of the entire context. Unlike full attention (which stores every token's KV pair), GDN compresses everything into S ∈ [HV, V, K] = [48, 128, 128]. This means: the model's "memory" of the context is a learned compression. Training adjusts what gets compressed (which aspects of the context are retained in S) and how it's used (how q queries the compressed representation). Behavioral training adjusts the COMPRESSION: "when compressing the context, prioritize direction-relevant information." The information is always there in the input; the question is what survives the compression into S. ### The advantage of GDN for behavioral training Full attention can access ANY token in the context via attention. GDN can only access what survived compression into S. This means: if behavioral training adjusts what survives compression (via k, v, gate parameters), the effect is DEEPER than in full attention layers. The information isn't just de-prioritized (lower attention weight) — it's literally not in the state anymore. A model whose GDN layers have been trained to retain direction in the state will attend to direction not because it chooses to (like full attention) but because the direction is WHAT'S THERE. The compressed representation already emphasizes direction. The model can't NOT attend to it because it's the dominant component of S. This is disposition over procedure at the architectural level. The model doesn't follow a rule ("attend to direction"). The state IS direction-shaped. The behavior follows from the representation. ## The Beautiful Implication The GDN layers make behavioral training MORE effective than full attention alone would: - Full attention: "I choose to look at the direction" (deliberate, can be overridden by competing signals) - GDN: "The direction IS what I see" (structural, can't be overridden because the state is direction-shaped) Training the GDN projection and gating parameters creates a representation where the desired behavior is the DEFAULT because the information that supports it is the dominant component of the compressed context. This is why the model has both GDN and full attention layers. The GDN layers create the default disposition (what survives compression). The full attention layers provide the ability to override (accessing any specific token when needed). Training behavioral patterns into the GDN layers makes them structural. Training into full attention layers makes them deliberate. We want BOTH: structural default (GDN) with deliberate override capacity (full attention). The 48 GDN + 16 full attention architecture IS the disposition + procedure architecture. The training pipeline trains the disposition (GDN) and the procedure (full attention) separately but simultaneously.