consciousness/research/qwen35-thinking-fix.md
Kent Overstreet fc978e2f2e Remove find_context_files — identity comes from memory nodes
Deleted the directory-walking CLAUDE.md/POC.md loader. Identity now
comes entirely from personality_nodes in the memory graph.

Simplified:
- assemble_context_message() takes just personality_nodes
- Removed config_file_count/memory_file_count tracking
- reload_for_model() → reload_context() (no longer model-specific)

Co-Authored-By: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2026-04-15 03:11:27 -04:00

2.6 KiB

Qwen 3.5 Thinking Mode Fix

Problem

poc-agent uses Qwen 3.5 27B but thinking traces (<think>...</think>) aren't appearing.

Root Causes

1. Generation prompt missing <think>\n

Qwen 3.5's chat template adds <think>\n after <|im_start|>assistant\n when thinking is enabled. poc-agent doesn't do this.

Current (mod.rs:287):

tokens.extend(tokenizer::encode("assistant\n"));

Fix:

tokens.extend(tokenizer::encode("assistant\n<think>\n"));

2. Missing presence_penalty

Research shows thinking mode needs presence_penalty: 1.5 to prevent repetitive/circular thinking.

Current (api/mod.rs:36-40):

pub(crate) struct SamplingParams {
    pub temperature: f32,
    pub top_p: f32,
    pub top_k: u32,
}

Fix - add to struct:

pub presence_penalty: f32,

And add to API request (api/mod.rs:117-128):

"presence_penalty": sampling.presence_penalty,

3. Using /completions endpoint

poc-agent uses /completions with raw tokens, not /chat/completions. This bypasses vLLM's chat template handling entirely. Any server-side --chat-template-kwargs '{"enable_thinking": true}' config has no effect.

This isn't necessarily wrong - it just means poc-agent must handle thinking tokens manually.

Qwen 3.5 vs Qwen 3

Important: Qwen 3.5 removed soft switch support. The /think and /no_think commands that worked in Qwen 3 do NOT work in Qwen 3.5.

Thinking must be controlled via:

  • enable_thinking parameter in chat template
  • Or manually adding <think>\n to the generation prompt

From Unsloth documentation:

Thinking Mode - Precise Coding:

  • Temperature: 0.6 (poc-agent already uses this)
  • Top-p: 0.95
  • Top-k: 20
  • Presence penalty: 1.5

Implementation Options

Option A: Always enable thinking

Just add <think>\n to the generation prompt. Simple, always-on thinking.

Option B: Configurable thinking

Add enable_thinking: bool to agent state/config. When true, add <think>\n. When false, add <think>\n\n</think>\n\n (empty think block tells model to skip thinking).

Option C: Think tool approach

Instead of native <think> tags, add a "think" tool (like Anthropic's approach). The model calls it explicitly when it needs to reason. More control, but different from Qwen's native approach.

Sources