Quantization recipe targeting the multimodal Qwen3.6-27B for vLLM
serving. Three pitfalls the script avoids, each documented inline:
1. Loader strip: `AutoModelForCausalLM` silently drops the vision
tower; we load via the config-declared
`Qwen3_5ForConditionalGeneration` instead.
2. Pattern anchor: llmcompressor matches the `ignore` list against
module names (no `.weight` suffix) when walking `named_modules()`,
not against full tensor names. Patterns now anchor on `$` at the
module name; the earlier `\.weight$` form silently quantized
lm_head and every linear_attn projection.
3. vLLM fusion: vLLM fuses {q,k,v}_proj into qkv_proj, gate+up into
gate_up_proj, and in_proj_qkv+in_proj_z into in_proj_qkvz. The
compressed_tensors loader rejects mixed schemes within a fused
layer, so the `ignore` list is shaped to keep all sub-components
of a fused layer consistent.
After `oneshot()` writes the FP8 output, MTP tensors (which the HF
class doesn't expose) are spliced in at BF16 from the upstream cached
snapshot, with the compressed_tensors metadata header preserved.
Recipe follows Unsloth's UD-Q8_K_XL late-stack overrides (FFN: 50,
51, 59, 62, 63; ATTN: 51, 59, 63), extended to include `v_proj` for
fusion compat. Final checkpoint is ~35 GB (matches Unsloth's GGUF
size to within ~1%) with vision tower BF16, MTP head BF16, and most
mlp/self_attn Linears at FP8_DYNAMIC.
Co-Authored-By: Proof of Concept <poc@bcachefs.org>
- Remove MEMORY_FILES constant from identity.rs
- Add ContextGroup struct for deserializing from config
- Load context_groups from ~/.config/poc-agent/config.json5
- Check ~/.config/poc-agent/ first for identity files, then project/global
- Debug screen now shows what's actually configured
This eliminates the hardcoded duplication and makes the debug output
match what's in the config file.
poc-memory now reads from poc-agent's config.json5 as the primary
config source. Memory-specific settings live in a "memory" section;
API credentials are resolved from the shared model/backend config
instead of being duplicated.
- Add "memory" section to ~/.config/poc-agent/config.json5
- poc-memory config.rs: try shared config first, fall back to
legacy JSONL
- API fields (base_url, api_key, model) resolved via
memory.agent_model -> models -> backend lookup
- Add json5 dependency for proper JSON5 parsing
- Update provisioning scripts: hermes -> qwen3_coder tool parser
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ROCm-specific setup with:
- AITER attention backends (VLLM_ROCM_USE_AITER=1)
- Reduced cudagraph capture size (DeltaNet cache conflict)
- BF16 model + FP8 KV cache as default (FP8 weights can be
slower on MI300X due to ROCm kernel maturity)
- FP8=1 flag for benchmarking FP8 model weights
Key for training plan: if FP8 matmuls are slow on MI300X,
the quantize-and-expand strategy needs B200 instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cursor index is into self.input, but the rendered buffer contains
the prompt prepended to the first line. Need to add prompt.len() to
get the correct character position when scanning the buffer.
- Always display reasoning tokens regardless of reasoning_effort
setting — Qwen 3.5 thinks natively and the reasoning parser
separates it into its own field
- Remove chat_template_kwargs that disabled thinking when
reasoning_effort was "none"
- Add chat_template_kwargs field to ChatRequest for vllm compat
- Update provision script: qwen3_xml tool parser, qwen3 reasoning
parser, 262K context, 95% GPU memory utilization
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sets up vllm with Qwen 2.5 27B Instruct, prefix caching enabled,
Hermes tool call parser for function calling support. Configurable
via environment variables (MODEL, PORT, MAX_MODEL_LEN).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All agent output now goes to the store as nodes instead of
markdown/JSON files. Each node carries a Provenance enum identifying
which agent created it (AgentDigest, AgentConsolidate, AgentFactMine,
AgentKnowledgeObservation, etc — 14 variants total).
Store changes:
- upsert_provenance() method for agent-created nodes
- Provenance enum expanded from 5 to 14 variants
Agent changes:
- digest: writes to store nodes (daily-YYYY-MM-DD.md etc)
- consolidate: reports/actions/logs stored as _consolidation-* nodes
- knowledge: depth DB and agent output stored as _knowledge-* nodes
- enrich: experience-mine results go directly to store
- llm: --no-session-persistence prevents transcript accumulation
Deleted: 14 Python/shell scripts replaced by Rust implementations.
Four layer-2 agents that produce new knowledge from the memory graph:
mine conversations, extract patterns from clusters, find cross-domain
connections, stress-test existing nodes. Output to agent-results/.
knowledge_loop.py runs them on a schedule with quality tracking.
- New spectral module: Laplacian eigendecomposition of the memory graph.
Commands: spectral, spectral-save, spectral-neighbors, spectral-positions,
spectral-suggest. Spectral neighbors expand search results beyond keyword
matching to structural proximity.
- Search: use StoreView trait to avoid 6MB state.bin rewrite on every query.
Append-only retrieval logging. Spectral expansion shows structurally
nearby nodes after text results.
- Fix panic in journal-tail: string truncation at byte 67 could land inside
a multi-byte character (em dash). Now walks back to char boundary.
- Replay queue: show classification and spectral outlier score.
- Knowledge agents: extractor, challenger, connector prompts and runner
scripts for automated graph enrichment.
- memory-search hook: stale state file cleanup (24h expiry).
Add store_helpers.py with shared helpers that call poc-memory commands
(list-keys, render, journal-tail) instead of globbing ~/.claude/memory/*.md
and parsing section headers.
All 9 Python scripts updated: get_semantic_keys(), get_topic_file_index(),
get_recent_journal(), parse_journal_entries(), read_journal_range(),
collect_topic_stems(), and file preview rendering now go through the store.
This completes the clean switch — no script reads archived markdown files.
Faster serialization/deserialization, smaller on disk (4.2MB vs 5.9MB).
Automatic migration from state.json on first load — reads the JSON,
writes state.bin, deletes the old file.
Added list-keys, list-edges, dump-json commands so Python scripts no
longer need to parse the cache directly. Updated bulk-categorize.py
and consolidation-loop.py to use the new CLI commands.