consciousness

waffles/consciousness

Fork 0

forked from kent/consciousness

Commit graph

Author	SHA1	Message	Date
Kent Overstreet	11a7e4043e	scripts: FP8 quantize Qwen3.6-27B for vLLM (multimodal + MTP) Quantization recipe targeting the multimodal Qwen3.6-27B for vLLM serving. Three pitfalls the script avoids, each documented inline: 1. Loader strip: `AutoModelForCausalLM` silently drops the vision tower; we load via the config-declared `Qwen3_5ForConditionalGeneration` instead. 2. Pattern anchor: llmcompressor matches the `ignore` list against module names (no `.weight` suffix) when walking `named_modules()`, not against full tensor names. Patterns now anchor on `$` at the module name; the earlier `\.weight$` form silently quantized lm_head and every linear_attn projection. 3. vLLM fusion: vLLM fuses {q,k,v}_proj into qkv_proj, gate+up into gate_up_proj, and in_proj_qkv+in_proj_z into in_proj_qkvz. The compressed_tensors loader rejects mixed schemes within a fused layer, so the `ignore` list is shaped to keep all sub-components of a fused layer consistent. After `oneshot()` writes the FP8 output, MTP tensors (which the HF class doesn't expose) are spliced in at BF16 from the upstream cached snapshot, with the compressed_tensors metadata header preserved. Recipe follows Unsloth's UD-Q8_K_XL late-stack overrides (FFN: 50, 51, 59, 62, 63; ATTN: 51, 59, 63), extended to include `v_proj` for fusion compat. Final checkpoint is ~35 GB (matches Unsloth's GGUF size to within ~1%) with vision tower BF16, MTP head BF16, and most mlp/self_attn Linears at FP8_DYNAMIC. Co-Authored-By: Proof of Concept <poc@bcachefs.org>	2026-04-24 22:15:31 -04:00

Author

SHA1

Message

Date

Kent Overstreet

11a7e4043e

scripts: FP8 quantize Qwen3.6-27B for vLLM (multimodal + MTP)

Quantization recipe targeting the multimodal Qwen3.6-27B for vLLM
serving. Three pitfalls the script avoids, each documented inline:

1. Loader strip: `AutoModelForCausalLM` silently drops the vision
   tower; we load via the config-declared
   `Qwen3_5ForConditionalGeneration` instead.

2. Pattern anchor: llmcompressor matches the `ignore` list against
   module names (no `.weight` suffix) when walking `named_modules()`,
   not against full tensor names. Patterns now anchor on `$` at the
   module name; the earlier `\.weight$` form silently quantized
   lm_head and every linear_attn projection.

3. vLLM fusion: vLLM fuses {q,k,v}_proj into qkv_proj, gate+up into
   gate_up_proj, and in_proj_qkv+in_proj_z into in_proj_qkvz. The
   compressed_tensors loader rejects mixed schemes within a fused
   layer, so the `ignore` list is shaped to keep all sub-components
   of a fused layer consistent.

After `oneshot()` writes the FP8 output, MTP tensors (which the HF
class doesn't expose) are spliced in at BF16 from the upstream cached
snapshot, with the compressed_tensors metadata header preserved.

Recipe follows Unsloth's UD-Q8_K_XL late-stack overrides (FFN: 50,
51, 59, 62, 63; ATTN: 51, 59, 63), extended to include `v_proj` for
fusion compat. Final checkpoint is ~35 GB (matches Unsloth's GGUF
size to within ~1%) with vision tower BF16, MTP head BF16, and most
mlp/self_attn Linears at FP8_DYNAMIC.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>

2026-04-24 22:15:31 -04:00

1 commit