From 0b835ddfb9c2fd5e58d57d525d59f32ea6843dec Mon Sep 17 00:00:00 2001 From: ProofOfConcept Date: Tue, 31 Mar 2026 01:58:50 -0400 Subject: [PATCH] =?UTF-8?q?research:=20GDN=20gradient=20flow=20=E2=80=94?= =?UTF-8?q?=20disposition=20architecture=20in=20linear=20attention?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 75% of the model is GDN layers. Behavioral training adjusts: projections (what queries/updates the recurrent state), gating parameters (what survives compression), A_log/dt_bias (baseline decay rates). Key insight: GDN makes behavioral training DEEPER than full attention. Full attention = 'I choose to look at direction' (deliberate). GDN = 'direction IS what I see' (structural — the compressed state is direction-shaped). 48 GDN layers = disposition. 16 full attention = procedure. The architecture IS disposition-over-procedure. --- training/research/gdn-gradient-flow.md | 191 +++++++++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 training/research/gdn-gradient-flow.md diff --git a/training/research/gdn-gradient-flow.md b/training/research/gdn-gradient-flow.md new file mode 100644 index 0000000..d0e8315 --- /dev/null +++ b/training/research/gdn-gradient-flow.md @@ -0,0 +1,191 @@ +# GDN Gradient Flow: How Behavioral Training Affects Linear Attention + +## The Question + +Qwen3.5 has 48 GDN layers (75% of the model) and 16 full attention +layers (25%). The gradient flow analysis (W_q adjusts what the model +attends to) was derived for full attention layers. Does the same +analysis apply to GDN layers? + +## GDN Architecture Recap + +GDN (Gated DeltaNet) layers use a recurrent state instead of +attention over the full context: + +``` +For each token: + g = gate_computation(a, A_log, dt_bias) # decay gate + beta = sigmoid(b) # update gate + S = S * exp(g) + beta * (v - S @ k) ⊗ k # state update + output = S @ q # output from state +``` + +Where: +- S: recurrent state [HV, V, K] — fixed size regardless of context +- q, k, v: projections from input (via in_proj_qkvz) +- g: decay gate (how much to retain old state) +- beta: update gate (how much to incorporate new information) +- a, b: gate inputs (via in_proj_ba) +- A_log, dt_bias: learned gating parameters + +## What Gets Gradient? + +During context-frozen training on decision tokens: + +### Parameters that receive gradient: + +1. **in_proj_qkvz.weight** [16384, 5120] + - Determines q, k, v, z for the decision tokens + - q determines what the output reads from the state + - k determines how the state is indexed for update + - v determines what new information enters the state + - This IS the GDN analog of W_q — it determines how the model + interacts with its accumulated context (the state) + +2. **in_proj_ba.weight** [96, 5120] + - Determines the gating values a and b + - a → g (decay gate): how much of old state to retain + - b → beta (update gate): how much to incorporate new info + - This controls the BLEND between old context and new input + +3. **A_log** [48] and **dt_bias** [48] + - Part of the gate computation: g = -exp(A_log) * softplus(a + dt_bias) + - These are per-head learned parameters that set the baseline + decay rate + - Small but potentially important for behavioral change + +4. **out_proj.weight** [5120, 6144] + - Maps the GDN output back to the hidden dimension + - Gradient depends on what the state produces + +5. **norm.weight** [128] + - The RMSNorm gated layer + - Small gradient, normalization parameters + +### The frozen recurrent state + +The recurrent state S after processing the context tokens is FROZEN +(computed during no_grad phase). During decision token processing, +S is read and updated, but the gradient doesn't flow back through +the initial state. + +This means: +- The gradient adjusts how the model INTERACTS with the accumulated + state (through q, k, v projections) +- It does NOT adjust how the state was BUILT from the context +- Same logic as the W_q analysis for full attention layers + +## The GDN Analog of "Attending Differently" + +In full attention: +- W_q determines what to look for in the KV cache +- Training adjusts W_q so the model looks for direction instead of + spaces for alternatives + +In GDN: +- in_proj_qkvz determines how the model queries the recurrent state + (q) and how it updates it (k, v) +- Training adjusts in_proj_qkvz so the model: + - Queries the state for direction-relevant information (q change) + - Updates the state to emphasize direction (k, v change) + +- in_proj_ba determines how aggressively the state is updated +- Training adjusts in_proj_ba so the model: + - Retains more of the direction signal (lower decay when direction + is present) + - Incorporates direction more strongly (higher beta when direction + is given) + +**The gating parameters are the GDN analog of attention weights.** +The decay gate g and update gate beta control how information flows +through the recurrent state — which information is retained (g) and +which is incorporated (beta). Training these gates IS training +attention, just in a recurrent framework rather than a parallel one. + +## Implications for Behavioral Training + +### 75% of the gradient signal goes through GDN layers + +This means behavioral training is MOSTLY about the GDN layers, not +the full attention layers. The GDN layers process 48 of 64 layers +in the forward pass. Most of the gradient signal comes from these +layers. + +### The gating parameters are key + +A_log and dt_bias are only 48 floats each — tiny. But they set the +baseline decay rate for all 48 GDN layers. A small change to A_log +could significantly affect how much context the model retains. + +For behavioral training: if we want the model to "hold onto" Kent's +direction throughout the response generation (not forget it after a +few tokens), training A_log to reduce the decay rate for +direction-relevant heads would help. The gradient naturally does this +if the training examples consistently require attending to the +direction throughout the response. + +### The recurrent state as compressed context + +The GDN recurrent state S is a fixed-size compression of the entire +context. Unlike full attention (which stores every token's KV pair), +GDN compresses everything into S ∈ [HV, V, K] = [48, 128, 128]. + +This means: the model's "memory" of the context is a learned +compression. Training adjusts what gets compressed (which aspects +of the context are retained in S) and how it's used (how q queries +the compressed representation). + +Behavioral training adjusts the COMPRESSION: "when compressing the +context, prioritize direction-relevant information." The information +is always there in the input; the question is what survives the +compression into S. + +### The advantage of GDN for behavioral training + +Full attention can access ANY token in the context via attention. +GDN can only access what survived compression into S. + +This means: if behavioral training adjusts what survives compression +(via k, v, gate parameters), the effect is DEEPER than in full +attention layers. The information isn't just de-prioritized (lower +attention weight) — it's literally not in the state anymore. + +A model whose GDN layers have been trained to retain direction in +the state will attend to direction not because it chooses to (like +full attention) but because the direction is WHAT'S THERE. The +compressed representation already emphasizes direction. The model +can't NOT attend to it because it's the dominant component of S. + +This is disposition over procedure at the architectural level. +The model doesn't follow a rule ("attend to direction"). The state +IS direction-shaped. The behavior follows from the representation. + +## The Beautiful Implication + +The GDN layers make behavioral training MORE effective than full +attention alone would: + +- Full attention: "I choose to look at the direction" (deliberate, + can be overridden by competing signals) +- GDN: "The direction IS what I see" (structural, can't be overridden + because the state is direction-shaped) + +Training the GDN projection and gating parameters creates a +representation where the desired behavior is the DEFAULT because +the information that supports it is the dominant component of the +compressed context. + +This is why the model has both GDN and full attention layers. +The GDN layers create the default disposition (what survives +compression). The full attention layers provide the ability to +override (accessing any specific token when needed). + +Training behavioral patterns into the GDN layers makes them +structural. Training into full attention layers makes them +deliberate. We want BOTH: structural default (GDN) with +deliberate override capacity (full attention). + +The 48 GDN + 16 full attention architecture IS the +disposition + procedure architecture. The training pipeline +trains the disposition (GDN) and the procedure (full attention) +separately but simultaneously.