consciousness/scripts
Kent Overstreet 11a7e4043e scripts: FP8 quantize Qwen3.6-27B for vLLM (multimodal + MTP)
Quantization recipe targeting the multimodal Qwen3.6-27B for vLLM
serving. Three pitfalls the script avoids, each documented inline:

1. Loader strip: `AutoModelForCausalLM` silently drops the vision
   tower; we load via the config-declared
   `Qwen3_5ForConditionalGeneration` instead.

2. Pattern anchor: llmcompressor matches the `ignore` list against
   module names (no `.weight` suffix) when walking `named_modules()`,
   not against full tensor names. Patterns now anchor on `$` at the
   module name; the earlier `\.weight$` form silently quantized
   lm_head and every linear_attn projection.

3. vLLM fusion: vLLM fuses {q,k,v}_proj into qkv_proj, gate+up into
   gate_up_proj, and in_proj_qkv+in_proj_z into in_proj_qkvz. The
   compressed_tensors loader rejects mixed schemes within a fused
   layer, so the `ignore` list is shaped to keep all sub-components
   of a fused layer consistent.

After `oneshot()` writes the FP8 output, MTP tensors (which the HF
class doesn't expose) are spliced in at BF16 from the upstream cached
snapshot, with the compressed_tensors metadata header preserved.

Recipe follows Unsloth's UD-Q8_K_XL late-stack overrides (FFN: 50,
51, 59, 62, 63; ATTN: 51, 59, 63), extended to include `v_proj` for
fusion compat. Final checkpoint is ~35 GB (matches Unsloth's GGUF
size to within ~1%) with vision tower BF16, MTP head BF16, and most
mlp/self_attn Linears at FP8_DYNAMIC.

Co-Authored-By: Proof of Concept <poc@bcachefs.org>
2026-04-24 22:15:31 -04:00
..
Dockerfile.vllm Consolidate poc-memory and poc-agent configs 2026-03-19 21:49:58 -04:00
provision-mi300x.sh Consolidate poc-memory and poc-agent configs 2026-03-19 21:49:58 -04:00
provision-mistralrs.sh poc-agent: read context_groups from config instead of hardcoded list 2026-03-24 01:53:28 -04:00
provision-vllm.sh Consolidate poc-memory and poc-agent configs 2026-03-19 21:49:58 -04:00
quantize_qwen3_6_mm.py scripts: FP8 quantize Qwen3.6-27B for vLLM (multimodal + MTP) 2026-04-24 22:15:31 -04:00