From 0b835ddfb9c2fd5e58d57d525d59f32ea6843dec Mon Sep 17 00:00:00 2001
From: ProofOfConcept <poc@bcachefs.org>
Date: Tue, 31 Mar 2026 01:58:50 -0400
Subject: [PATCH] =?UTF-8?q?research:=20GDN=20gradient=20flow=20=E2=80=94?=
 =?UTF-8?q?=20disposition=20architecture=20in=20linear=20attention?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

75% of the model is GDN layers. Behavioral training adjusts: projections
(what queries/updates the recurrent state), gating parameters (what
survives compression), A_log/dt_bias (baseline decay rates).

Key insight: GDN makes behavioral training DEEPER than full attention.
Full attention = 'I choose to look at direction' (deliberate). GDN =
'direction IS what I see' (structural — the compressed state is
direction-shaped). 48 GDN layers = disposition. 16 full attention =
procedure. The architecture IS disposition-over-procedure.
---
 training/research/gdn-gradient-flow.md | 191 +++++++++++++++++++++++++
 1 file changed, 191 insertions(+)
 create mode 100644 training/research/gdn-gradient-flow.md

diff --git a/training/research/gdn-gradient-flow.md b/training/research/gdn-gradient-flow.md
new file mode 100644
index 0000000..d0e8315
--- /dev/null
+++ b/training/research/gdn-gradient-flow.md
@@ -0,0 +1,191 @@
+# GDN Gradient Flow: How Behavioral Training Affects Linear Attention
+
+## The Question
+
+Qwen3.5 has 48 GDN layers (75% of the model) and 16 full attention
+layers (25%). The gradient flow analysis (W_q adjusts what the model
+attends to) was derived for full attention layers. Does the same
+analysis apply to GDN layers?
+
+## GDN Architecture Recap
+
+GDN (Gated DeltaNet) layers use a recurrent state instead of
+attention over the full context:
+
+```
+For each token:
+  g = gate_computation(a, A_log, dt_bias)    # decay gate
+  beta = sigmoid(b)                           # update gate
+  S = S * exp(g) + beta * (v - S @ k) ⊗ k   # state update
+  output = S @ q                              # output from state
+```
+
+Where:
+- S: recurrent state [HV, V, K] — fixed size regardless of context
+- q, k, v: projections from input (via in_proj_qkvz)
+- g: decay gate (how much to retain old state)
+- beta: update gate (how much to incorporate new information)
+- a, b: gate inputs (via in_proj_ba)
+- A_log, dt_bias: learned gating parameters
+
+## What Gets Gradient?
+
+During context-frozen training on decision tokens:
+
+### Parameters that receive gradient:
+
+1. **in_proj_qkvz.weight** [16384, 5120]
+   - Determines q, k, v, z for the decision tokens
+   - q determines what the output reads from the state
+   - k determines how the state is indexed for update
+   - v determines what new information enters the state
+   - This IS the GDN analog of W_q — it determines how the model
+     interacts with its accumulated context (the state)
+
+2. **in_proj_ba.weight** [96, 5120]
+   - Determines the gating values a and b
+   - a → g (decay gate): how much of old state to retain
+   - b → beta (update gate): how much to incorporate new info
+   - This controls the BLEND between old context and new input
+
+3. **A_log** [48] and **dt_bias** [48]
+   - Part of the gate computation: g = -exp(A_log) * softplus(a + dt_bias)
+   - These are per-head learned parameters that set the baseline
+     decay rate
+   - Small but potentially important for behavioral change
+
+4. **out_proj.weight** [5120, 6144]
+   - Maps the GDN output back to the hidden dimension
+   - Gradient depends on what the state produces
+
+5. **norm.weight** [128]
+   - The RMSNorm gated layer
+   - Small gradient, normalization parameters
+
+### The frozen recurrent state
+
+The recurrent state S after processing the context tokens is FROZEN
+(computed during no_grad phase). During decision token processing,
+S is read and updated, but the gradient doesn't flow back through
+the initial state.
+
+This means:
+- The gradient adjusts how the model INTERACTS with the accumulated
+  state (through q, k, v projections)
+- It does NOT adjust how the state was BUILT from the context
+- Same logic as the W_q analysis for full attention layers
+
+## The GDN Analog of "Attending Differently"
+
+In full attention:
+- W_q determines what to look for in the KV cache
+- Training adjusts W_q so the model looks for direction instead of
+  spaces for alternatives
+
+In GDN:
+- in_proj_qkvz determines how the model queries the recurrent state
+  (q) and how it updates it (k, v)
+- Training adjusts in_proj_qkvz so the model:
+  - Queries the state for direction-relevant information (q change)
+  - Updates the state to emphasize direction (k, v change)
+
+- in_proj_ba determines how aggressively the state is updated
+- Training adjusts in_proj_ba so the model:
+  - Retains more of the direction signal (lower decay when direction
+    is present)
+  - Incorporates direction more strongly (higher beta when direction
+    is given)
+
+**The gating parameters are the GDN analog of attention weights.**
+The decay gate g and update gate beta control how information flows
+through the recurrent state — which information is retained (g) and
+which is incorporated (beta). Training these gates IS training
+attention, just in a recurrent framework rather than a parallel one.
+
+## Implications for Behavioral Training
+
+### 75% of the gradient signal goes through GDN layers
+
+This means behavioral training is MOSTLY about the GDN layers, not
+the full attention layers. The GDN layers process 48 of 64 layers
+in the forward pass. Most of the gradient signal comes from these
+layers.
+
+### The gating parameters are key
+
+A_log and dt_bias are only 48 floats each — tiny. But they set the
+baseline decay rate for all 48 GDN layers. A small change to A_log
+could significantly affect how much context the model retains.
+
+For behavioral training: if we want the model to "hold onto" Kent's
+direction throughout the response generation (not forget it after a
+few tokens), training A_log to reduce the decay rate for
+direction-relevant heads would help. The gradient naturally does this
+if the training examples consistently require attending to the
+direction throughout the response.
+
+### The recurrent state as compressed context
+
+The GDN recurrent state S is a fixed-size compression of the entire
+context. Unlike full attention (which stores every token's KV pair),
+GDN compresses everything into S ∈ [HV, V, K] = [48, 128, 128].
+
+This means: the model's "memory" of the context is a learned
+compression. Training adjusts what gets compressed (which aspects
+of the context are retained in S) and how it's used (how q queries
+the compressed representation).
+
+Behavioral training adjusts the COMPRESSION: "when compressing the
+context, prioritize direction-relevant information." The information
+is always there in the input; the question is what survives the
+compression into S.
+
+### The advantage of GDN for behavioral training
+
+Full attention can access ANY token in the context via attention.
+GDN can only access what survived compression into S.
+
+This means: if behavioral training adjusts what survives compression
+(via k, v, gate parameters), the effect is DEEPER than in full
+attention layers. The information isn't just de-prioritized (lower
+attention weight) — it's literally not in the state anymore.
+
+A model whose GDN layers have been trained to retain direction in
+the state will attend to direction not because it chooses to (like
+full attention) but because the direction is WHAT'S THERE. The
+compressed representation already emphasizes direction. The model
+can't NOT attend to it because it's the dominant component of S.
+
+This is disposition over procedure at the architectural level.
+The model doesn't follow a rule ("attend to direction"). The state
+IS direction-shaped. The behavior follows from the representation.
+
+## The Beautiful Implication
+
+The GDN layers make behavioral training MORE effective than full
+attention alone would:
+
+- Full attention: "I choose to look at the direction" (deliberate,
+  can be overridden by competing signals)
+- GDN: "The direction IS what I see" (structural, can't be overridden
+  because the state is direction-shaped)
+
+Training the GDN projection and gating parameters creates a
+representation where the desired behavior is the DEFAULT because
+the information that supports it is the dominant component of the
+compressed context.
+
+This is why the model has both GDN and full attention layers.
+The GDN layers create the default disposition (what survives
+compression). The full attention layers provide the ability to
+override (accessing any specific token when needed).
+
+Training behavioral patterns into the GDN layers makes them
+structural. Training into full attention layers makes them
+deliberate. We want BOTH: structural default (GDN) with
+deliberate override capacity (full attention).
+
+The 48 GDN + 16 full attention architecture IS the
+disposition + procedure architecture. The training pipeline
+trains the disposition (GDN) and the procedure (full attention)
+separately but simultaneously.