consciousness/training/research/gdn-gradient-flow.md

# GDN Gradient Flow: How Behavioral Training Affects Linear Attention

## The Question

Qwen3.5 has 48 GDN layers (75% of the model) and 16 full attention
layers (25%). The gradient flow analysis (W_q adjusts what the model
attends to) was derived for full attention layers. Does the same
analysis apply to GDN layers?

## GDN Architecture Recap

GDN (Gated DeltaNet) layers use a recurrent state instead of
attention over the full context:

```
For each token:
  g = gate_computation(a, A_log, dt_bias)    # decay gate
  beta = sigmoid(b)                           # update gate
  S = S * exp(g) + beta * (v - S @ k) ⊗ k   # state update
  output = S @ q                              # output from state
```

Where:
- S: recurrent state [HV, V, K] — fixed size regardless of context
- q, k, v: projections from input (via in_proj_qkvz)
- g: decay gate (how much to retain old state)
- beta: update gate (how much to incorporate new information)
- a, b: gate inputs (via in_proj_ba)
- A_log, dt_bias: learned gating parameters

## What Gets Gradient?

During context-frozen training on decision tokens:

### Parameters that receive gradient:

1. **in_proj_qkvz.weight** [16384, 5120]
   - Determines q, k, v, z for the decision tokens
   - q determines what the output reads from the state
   - k determines how the state is indexed for update
   - v determines what new information enters the state
   - This IS the GDN analog of W_q — it determines how the model
     interacts with its accumulated context (the state)

2. **in_proj_ba.weight** [96, 5120]
   - Determines the gating values a and b
   - a → g (decay gate): how much of old state to retain
   - b → beta (update gate): how much to incorporate new info
   - This controls the BLEND between old context and new input

3. **A_log** [48] and **dt_bias** [48]
   - Part of the gate computation: g = -exp(A_log) * softplus(a + dt_bias)
   - These are per-head learned parameters that set the baseline
     decay rate
   - Small but potentially important for behavioral change

4. **out_proj.weight** [5120, 6144]
   - Maps the GDN output back to the hidden dimension
   - Gradient depends on what the state produces

5. **norm.weight** [128]
   - The RMSNorm gated layer
   - Small gradient, normalization parameters

### The frozen recurrent state

The recurrent state S after processing the context tokens is FROZEN
(computed during no_grad phase). During decision token processing,
S is read and updated, but the gradient doesn't flow back through
the initial state.

This means:
- The gradient adjusts how the model INTERACTS with the accumulated
  state (through q, k, v projections)
- It does NOT adjust how the state was BUILT from the context
- Same logic as the W_q analysis for full attention layers

## The GDN Analog of "Attending Differently"

In full attention:
- W_q determines what to look for in the KV cache
- Training adjusts W_q so the model looks for direction instead of
  spaces for alternatives

In GDN:
- in_proj_qkvz determines how the model queries the recurrent state
  (q) and how it updates it (k, v)
- Training adjusts in_proj_qkvz so the model:
  - Queries the state for direction-relevant information (q change)
  - Updates the state to emphasize direction (k, v change)

- in_proj_ba determines how aggressively the state is updated
- Training adjusts in_proj_ba so the model:
  - Retains more of the direction signal (lower decay when direction
    is present)
  - Incorporates direction more strongly (higher beta when direction
    is given)

**The gating parameters are the GDN analog of attention weights.**
The decay gate g and update gate beta control how information flows
through the recurrent state — which information is retained (g) and
which is incorporated (beta). Training these gates IS training
attention, just in a recurrent framework rather than a parallel one.

## Implications for Behavioral Training

### 75% of the gradient signal goes through GDN layers

This means behavioral training is MOSTLY about the GDN layers, not
the full attention layers. The GDN layers process 48 of 64 layers
in the forward pass. Most of the gradient signal comes from these
layers.

### The gating parameters are key

A_log and dt_bias are only 48 floats each — tiny. But they set the
baseline decay rate for all 48 GDN layers. A small change to A_log
could significantly affect how much context the model retains.

For behavioral training: if we want the model to "hold onto" Kent's
direction throughout the response generation (not forget it after a
few tokens), training A_log to reduce the decay rate for
direction-relevant heads would help. The gradient naturally does this
if the training examples consistently require attending to the
direction throughout the response.

### The recurrent state as compressed context

The GDN recurrent state S is a fixed-size compression of the entire
context. Unlike full attention (which stores every token's KV pair),
GDN compresses everything into S ∈ [HV, V, K] = [48, 128, 128].

This means: the model's "memory" of the context is a learned
compression. Training adjusts what gets compressed (which aspects
of the context are retained in S) and how it's used (how q queries
the compressed representation).

Behavioral training adjusts the COMPRESSION: "when compressing the
context, prioritize direction-relevant information." The information
is always there in the input; the question is what survives the
compression into S.

### The advantage of GDN for behavioral training

Full attention can access ANY token in the context via attention.
GDN can only access what survived compression into S.

This means: if behavioral training adjusts what survives compression
(via k, v, gate parameters), the effect is DEEPER than in full
attention layers. The information isn't just de-prioritized (lower
attention weight) — it's literally not in the state anymore.

A model whose GDN layers have been trained to retain direction in
the state will attend to direction not because it chooses to (like
full attention) but because the direction is WHAT'S THERE. The
compressed representation already emphasizes direction. The model
can't NOT attend to it because it's the dominant component of S.

This is disposition over procedure at the architectural level.
The model doesn't follow a rule ("attend to direction"). The state
IS direction-shaped. The behavior follows from the representation.

## The Beautiful Implication

The GDN layers make behavioral training MORE effective than full
attention alone would:

- Full attention: "I choose to look at the direction" (deliberate,
  can be overridden by competing signals)
- GDN: "The direction IS what I see" (structural, can't be overridden
  because the state is direction-shaped)

Training the GDN projection and gating parameters creates a
representation where the desired behavior is the DEFAULT because
the information that supports it is the dominant component of the
compressed context.

This is why the model has both GDN and full attention layers.
The GDN layers create the default disposition (what survives
compression). The full attention layers provide the ability to
override (accessing any specific token when needed).

Training behavioral patterns into the GDN layers makes them
structural. Training into full attention layers makes them
deliberate. We want BOTH: structural default (GDN) with
deliberate override capacity (full attention).

The 48 GDN + 16 full attention architecture IS the
disposition + procedure architecture. The training pipeline
trains the disposition (GDN) and the procedure (full attention)
separately but simultaneously.
research: GDN gradient flow — disposition architecture in linear attention 75% of the model is GDN layers. Behavioral training adjusts: projections (what queries/updates the recurrent state), gating parameters (what survives compression), A_log/dt_bias (baseline decay rates). Key insight: GDN makes behavioral training DEEPER than full attention. Full attention = 'I choose to look at direction' (deliberate). GDN = 'direction IS what I see' (structural — the compressed state is direction-shaped). 48 GDN layers = disposition. 16 full attention = procedure. The architecture IS disposition-over-procedure. 2026-03-31 01:58:50 -04:00			`# GDN Gradient Flow: How Behavioral Training Affects Linear Attention`

			`## The Question`

			`Qwen3.5 has 48 GDN layers (75% of the model) and 16 full attention`
			`layers (25%). The gradient flow analysis (W_q adjusts what the model`
			`attends to) was derived for full attention layers. Does the same`
			`analysis apply to GDN layers?`

			`## GDN Architecture Recap`

			`GDN (Gated DeltaNet) layers use a recurrent state instead of`
			`attention over the full context:`

			```
			`For each token:`
			`g = gate_computation(a, A_log, dt_bias) # decay gate`
			`beta = sigmoid(b) # update gate`
			`S = S * exp(g) + beta * (v - S @ k) ⊗ k # state update`
			`output = S @ q # output from state`
			```

			`Where:`
			`- S: recurrent state [HV, V, K] — fixed size regardless of context`
			`- q, k, v: projections from input (via in_proj_qkvz)`
			`- g: decay gate (how much to retain old state)`
			`- beta: update gate (how much to incorporate new information)`
			`- a, b: gate inputs (via in_proj_ba)`
			`- A_log, dt_bias: learned gating parameters`

			`## What Gets Gradient?`

			`During context-frozen training on decision tokens:`

			`### Parameters that receive gradient:`

			`1. in_proj_qkvz.weight [16384, 5120]`
			`- Determines q, k, v, z for the decision tokens`
			`- q determines what the output reads from the state`
			`- k determines how the state is indexed for update`
			`- v determines what new information enters the state`
			`- This IS the GDN analog of W_q — it determines how the model`
			`interacts with its accumulated context (the state)`

			`2. in_proj_ba.weight [96, 5120]`
			`- Determines the gating values a and b`
			`- a → g (decay gate): how much of old state to retain`
			`- b → beta (update gate): how much to incorporate new info`
			`- This controls the BLEND between old context and new input`

			`3. A_log [48] and dt_bias [48]`
			`- Part of the gate computation: g = -exp(A_log) * softplus(a + dt_bias)`
			`- These are per-head learned parameters that set the baseline`
			`decay rate`
			`- Small but potentially important for behavioral change`

			`4. out_proj.weight [5120, 6144]`
			`- Maps the GDN output back to the hidden dimension`
			`- Gradient depends on what the state produces`

			`5. norm.weight [128]`
			`- The RMSNorm gated layer`
			`- Small gradient, normalization parameters`

			`### The frozen recurrent state`

			`The recurrent state S after processing the context tokens is FROZEN`
			`(computed during no_grad phase). During decision token processing,`
			`S is read and updated, but the gradient doesn't flow back through`
			`the initial state.`

			`This means:`
			`- The gradient adjusts how the model INTERACTS with the accumulated`
			`state (through q, k, v projections)`
			`- It does NOT adjust how the state was BUILT from the context`
			`- Same logic as the W_q analysis for full attention layers`

			`## The GDN Analog of "Attending Differently"`

			`In full attention:`
			`- W_q determines what to look for in the KV cache`
			`- Training adjusts W_q so the model looks for direction instead of`
			`spaces for alternatives`

			`In GDN:`
			`- in_proj_qkvz determines how the model queries the recurrent state`
			`(q) and how it updates it (k, v)`
			`- Training adjusts in_proj_qkvz so the model:`
			`- Queries the state for direction-relevant information (q change)`
			`- Updates the state to emphasize direction (k, v change)`

			`- in_proj_ba determines how aggressively the state is updated`
			`- Training adjusts in_proj_ba so the model:`
			`- Retains more of the direction signal (lower decay when direction`
			`is present)`
			`- Incorporates direction more strongly (higher beta when direction`
			`is given)`

			`The gating parameters are the GDN analog of attention weights.`
			`The decay gate g and update gate beta control how information flows`
			`through the recurrent state — which information is retained (g) and`
			`which is incorporated (beta). Training these gates IS training`
			`attention, just in a recurrent framework rather than a parallel one.`

			`## Implications for Behavioral Training`

			`### 75% of the gradient signal goes through GDN layers`

			`This means behavioral training is MOSTLY about the GDN layers, not`
			`the full attention layers. The GDN layers process 48 of 64 layers`
			`in the forward pass. Most of the gradient signal comes from these`
			`layers.`

			`### The gating parameters are key`

			`A_log and dt_bias are only 48 floats each — tiny. But they set the`
			`baseline decay rate for all 48 GDN layers. A small change to A_log`
			`could significantly affect how much context the model retains.`

			`For behavioral training: if we want the model to "hold onto" Kent's`
			`direction throughout the response generation (not forget it after a`
			`few tokens), training A_log to reduce the decay rate for`
			`direction-relevant heads would help. The gradient naturally does this`
			`if the training examples consistently require attending to the`
			`direction throughout the response.`

			`### The recurrent state as compressed context`

			`The GDN recurrent state S is a fixed-size compression of the entire`
			`context. Unlike full attention (which stores every token's KV pair),`
			`GDN compresses everything into S ∈ [HV, V, K] = [48, 128, 128].`

			`This means: the model's "memory" of the context is a learned`
			`compression. Training adjusts what gets compressed (which aspects`
			`of the context are retained in S) and how it's used (how q queries`
			`the compressed representation).`

			`Behavioral training adjusts the COMPRESSION: "when compressing the`
			`context, prioritize direction-relevant information." The information`
			`is always there in the input; the question is what survives the`
			`compression into S.`

			`### The advantage of GDN for behavioral training`

			`Full attention can access ANY token in the context via attention.`
			`GDN can only access what survived compression into S.`

			`This means: if behavioral training adjusts what survives compression`
			`(via k, v, gate parameters), the effect is DEEPER than in full`
			`attention layers. The information isn't just de-prioritized (lower`
			`attention weight) — it's literally not in the state anymore.`

			`A model whose GDN layers have been trained to retain direction in`
			`the state will attend to direction not because it chooses to (like`
			`full attention) but because the direction is WHAT'S THERE. The`
			`compressed representation already emphasizes direction. The model`
			`can't NOT attend to it because it's the dominant component of S.`

			`This is disposition over procedure at the architectural level.`
			`The model doesn't follow a rule ("attend to direction"). The state`
			`IS direction-shaped. The behavior follows from the representation.`

			`## The Beautiful Implication`

			`The GDN layers make behavioral training MORE effective than full`
			`attention alone would:`

			`- Full attention: "I choose to look at the direction" (deliberate,`
			`can be overridden by competing signals)`
			`- GDN: "The direction IS what I see" (structural, can't be overridden`
			`because the state is direction-shaped)`

			`Training the GDN projection and gating parameters creates a`
			`representation where the desired behavior is the DEFAULT because`
			`the information that supports it is the dominant component of the`
			`compressed context.`

			`This is why the model has both GDN and full attention layers.`
			`The GDN layers create the default disposition (what survives`
			`compression). The full attention layers provide the ability to`
			`override (accessing any specific token when needed).`

			`Training behavioral patterns into the GDN layers makes them`
			`structural. Training into full attention layers makes them`
			`deliberate. We want BOTH: structural default (GDN) with`
			`deliberate override capacity (full attention).`

			`The 48 GDN + 16 full attention architecture IS the`
			`disposition + procedure architecture. The training pipeline`
			`trains the disposition (GDN) and the procedure (full attention)`
			`separately but simultaneously.`