Quantization recipe targeting the multimodal Qwen3.6-27B for vLLM
serving. Three pitfalls the script avoids, each documented inline:
1. Loader strip: `AutoModelForCausalLM` silently drops the vision
tower; we load via the config-declared
`Qwen3_5ForConditionalGeneration` instead.
2. Pattern anchor: llmcompressor matches the `ignore` list against
module names (no `.weight` suffix) when walking `named_modules()`,
not against full tensor names. Patterns now anchor on `$` at the
module name; the earlier `\.weight$` form silently quantized
lm_head and every linear_attn projection.
3. vLLM fusion: vLLM fuses {q,k,v}_proj into qkv_proj, gate+up into
gate_up_proj, and in_proj_qkv+in_proj_z into in_proj_qkvz. The
compressed_tensors loader rejects mixed schemes within a fused
layer, so the `ignore` list is shaped to keep all sub-components
of a fused layer consistent.
After `oneshot()` writes the FP8 output, MTP tensors (which the HF
class doesn't expose) are spliced in at BF16 from the upstream cached
snapshot, with the compressed_tensors metadata header preserved.
Recipe follows Unsloth's UD-Q8_K_XL late-stack overrides (FFN: 50,
51, 59, 62, 63; ATTN: 51, 59, 63), extended to include `v_proj` for
fusion compat. Final checkpoint is ~35 GB (matches Unsloth's GGUF
size to within ~1%) with vision tower BF16, MTP head BF16, and most
mlp/self_attn Linears at FP8_DYNAMIC.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>