ProofOfConcept aad227e487 query: unify PEG and engine parsers

PEG parser now handles both expression syntax (degree > 5 | sort degree)
and pipeline syntax (all | type:episodic | sort:timestamp). Deleted
Stage::parse() and helpers from engine.rs — it's now pure execution.

All callers use parse_stages() from parser.rs as the single entry point.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

2026-04-11 20:42:58 -04:00

8.7 KiB

Raw Blame History

Sparse Kernel Compilation: Napkin Math for Qwen3.5-27B

Architecture recap

Parameter	Value
Layers	64 (48 linear attention, 16 full attention)
Hidden dim (H)	5,120
Query heads	24
KV heads	4 (GQA)
Head dim	256
FFN intermediate	17,408
Full attention interval	every 4th layer
Total params	~27.5B

Key discovery: Qwen3.5 already uses sparse attention — 48/64 layers use linear attention (O(N)), only 16 use full O(N^2) attention. The attention sparsity question is partially answered by the architecture itself. The remaining opportunity is weight sparsity in the projection and FFN matrices.

Measured weight sparsity (from B200, all 11 shards)

Component	Count	Total params	<0.001	<0.005	<0.01	<0.02
down_proj	65	5,793,382,400	7.4%	35.8%	64.3%	92.8%
gate_proj	65	5,793,382,400	7.2%	35.0%	63.3%	92.3%
up_proj	65	5,793,382,400	7.3%	35.2%	63.5%	92.5%
q_proj	17	1,069,547,520	5.7%	27.9%	51.9%	82.8%
k_proj	17	89,128,960	6.0%	29.4%	54.1%	84.1%
v_proj	17	89,128,960	4.8%	23.7%	44.7%	74.2%
o_proj	17	534,773,760	5.3%	25.9%	48.7%	79.6%
other	605	6,070,652,272	5.4%	26.4%	49.6%	80.9%

Key findings:

FFN layers (gate/up/down) are remarkably sparse: ~64% of weights below 0.01
At threshold 0.02, FFN sparsity exceeds 92%
Attention projections are less sparse but still significant: 45-52% below 0.01
v_proj is the least sparse component (44.7% below 0.01)

Per-layer parameter breakdown

QKV projection: H × (Q_dim + K_dim + V_dim) ≈ 5120 × 13312 ≈ 68M
Output projection: ~26M
FFN (gate + up + down): 5120 × 17408 × 3 ≈ 267M
Total per layer: ~361M
64 layers: ~23.1B (rest is embeddings, norms, etc.)

Dense baseline: FLOPs per token

Each weight parameter contributes 2 FLOPs per token (multiply + accumulate).

Per layer: ~722M FLOPs/token
64 layers: ~46.2B FLOPs/token

On a B200 (theoretical ~4.5 PFLOPS FP8, ~1.1 PFLOPS BF16):

BF16 throughput: 46.2B / 1.1e15 ≈ 0.042ms per token (compute-bound limit)
But inference is usually memory-bandwidth-bound for small batch sizes

B200 HBM bandwidth: ~8 TB/s. At BF16 (2 bytes/param):

Loading all weights once: 27.5B × 2 = 55GB → 55/8000 ≈ 6.9ms per token (batch=1)
This is the real bottleneck. Compute is cheap; loading weights is expensive.

What sparsity buys you

At X% sparsity (X% of weights are zero), you need to load (1-X)% of the weights:

Sparsity	Params loaded	Time (batch=1)	Speedup
0% (dense)	27.5B	6.9ms	1.0x
50%	13.8B	3.4ms	2.0x
75%	6.9B	1.7ms	4.0x
90%	2.8B	0.7ms	10x

This is the key insight: inference at small batch sizes is memory-bandwidth-bound. Sparse weights = fewer bytes to load = directly proportional speedup, IF the sparse kernel can avoid the gather/scatter overhead.

The compilation problem

Dense GEMM is fast because:

Weights are contiguous in memory → sequential reads
Tiling fits perfectly in SRAM → high reuse
Hardware tensor cores expect dense blocks

Naive sparse matmul kills this:

Irregular memory access → low bandwidth utilization
Poor SRAM tiling → cache thrashing
Tensor cores can't help

The FlashAttention analogy

FlashAttention's insight: the N×N attention matrix doesn't fit in SRAM, but you can tile it so each tile does. You recompute instead of materializing.

Sparse kernel compilation insight: the sparsity pattern is known at compile time. A compiler can:

Analyze the sparsity graph of each weight matrix
Find blocks of non-zero weights that are close in memory
Generate a tiling schedule that loads these blocks into SRAM efficiently
Emit fused kernels where the memory access pattern is baked in as constants

The resulting kernel looks like a dense kernel to the hardware — sequential reads, high SRAM reuse, maybe even tensor core compatible (if the compiler finds dense sub-blocks within the sparse matrix).

Block sparsity vs unstructured sparsity

Block sparse (e.g., 4×4 or 16×16 blocks zeroed out):

GPU-friendly: blocks map to tensor core operations
Less flexible: coarser pruning granularity → less achievable sparsity
NVIDIA's 2:4 structured sparsity gets ~50% sparsity with tensor core support
Real-world: typically 50-70% sparsity achievable without quality loss

Unstructured sparse (individual weights zeroed):

Maximally flexible: fine-grained pruning → higher achievable sparsity
GPU-hostile: the gather/scatter problem
Real-world: 80-95% sparsity achievable in many layers without quality loss

The compiled kernel approach bridges this: take unstructured sparsity (maximally flexible, high compression) and compile it into a kernel that runs as efficiently as block-sparse. Best of both worlds.

Recurrent depth composability

From our April 10 discussion: middle transformer layers are doing open-coded simulated annealing — similar weights, similar computation.

If layers 8-24 have cosine similarity > 0.95:

Replace 16 layers with 1 layer × 16 iterations
Parameter reduction: 16 × 361M = 5.8B → 361M (16x reduction for those layers)
Memory bandwidth: load one layer's weights, iterate in SRAM
Combined with 50% sparsity on the remaining unique layers:
- Unique layers (48): 48 × 361M × 0.5 = 8.7B params
- Recurrent layer: 361M × 0.5 = 180M params (but iterated 16x in SRAM)
- Total loaded per token: ~8.9B × 2 bytes = 17.8GB
- Time: 17.8/8000 ≈ 2.2ms per token (vs 6.9ms dense) — 3.1x speedup

With higher sparsity (75%) + recurrence:

Unique layers: 48 × 361M × 0.25 = 4.3B
Recurrent: 90M
Total: ~4.4B × 2 = 8.8GB → 1.1ms per token — 6.3x speedup

What needs to happen

Phase 1: Measure (can do now with B200 access)

Extract all weight matrices from Qwen3.5-27B
For each matrix, compute:
- Magnitude distribution (what % of weights are near-zero?)
- Achievable sparsity at various thresholds (L1 magnitude pruning)
- Dense sub-block statistics (how many 4×4, 16×16 blocks are all-zero?)
Layer similarity: pairwise cosine similarity of weight matrices across layers
- Which layers are nearly identical? (recurrence candidates)
Validate quality: run perplexity eval at various sparsity levels

Phase 2: Compile (research project)

For a single sparse weight matrix, generate an optimized Triton kernel
Benchmark vs dense GEMM and vs NVIDIA's 2:4 sparse
Iterate on the tiling strategy

Phase 3: End-to-end

Full model with compiled sparse kernels
Perplexity + latency benchmarks
Compare: dense, 2:4 structured, compiled unstructured

SparseGPT (Frantar & Alistarh, 2023): one-shot pruning to 50-60% unstructured sparsity with minimal quality loss. Key result: large models are more prunable than small ones.
Wanda (Sun et al., 2023): pruning by weight magnitude × input activation. Simpler than SparseGPT, comparable results.
NVIDIA 2:4 sparsity: hardware-supported structured sparsity on Ampere+. 50% sparsity, ~2x speedup on tensor cores. The existence proof that sparse can be fast.
Triton (Tillet et al.): Python DSL for GPU kernel generation. The right compilation target — can express arbitrary tiling strategies.
TACO (Kjolstad et al.): tensor algebra compiler. Generates kernels for specific sparse tensor formats. Academic but the ideas are right.
FlashAttention (Dao et al.): the tiling strategy to learn from.
DejaVu (Liu et al., 2023): contextual sparsity — predicting which neurons to activate per input. Dynamic sparsity, complementary to weight sparsity.

The bigger picture

Current state of the art: dense models with FlashAttention for the N×N attention part. Weight sparsity is known to work (SparseGPT, Wanda) but isn't deployed because the GPU kernels don't exist to exploit it efficiently.

The gap: nobody has built a compiler that takes a specific sparse weight matrix and emits a kernel optimized for that exact pattern. FlashAttention proved that custom kernels for specific computational patterns beat general-purpose ones. The same should hold for sparse weight patterns.

The bet: a compiled sparse kernel for Qwen3.5-27B's actual sparsity pattern would be within 80% of the theoretical bandwidth-bound speedup. If true, 50% sparsity → 1.6x real speedup, 75% → 3.2x, composing with recurrent depth for potentially 5-6x total.

That would make 27B inference as fast as a 5B dense model, with 27B quality.

8.7 KiB Raw Blame History Unescape Escape