Qwen 3.6 CoT issue?

Reddit r/LocalLLaMA / 4/19/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • A user reports an issue with Qwen 3.6 CoT handling where the model sometimes ends a <think> block with the multi-token sequence < /thinking > instead of the expected single < /think > token, causing the harness to miss end-of-CoT detection.
  • The behavior appeared while running Qwen 3.6 A3B in llama-server and was observed only in a handful of cases, including at relatively low n_past values.
  • The user suggests the problem may be related to quantization (using iq4_nl via Unsloth quant with unquantized cache/state), but asks whether others have seen the same pattern.
  • A suggested workaround would require sampler/KV-cache-level handling, which would entail modifying llama-server or re-implementing an OpenAI-style completions API.
  • On the harness side, the symptom is an API failure indicating the model returned no output (or similar), triggered by end-of-CoT detection failing.

So the Qwen vocab has distinct tokens for <think> and </think>. I know this because an app I wrote pushes those tokens to the cache after <|im_start|>assistant to stop CoT selectively. Great.

Yesterday I was fucking around with some coding harnesses and qwen 3.6 A3B running in llama-server, and it worked rather well except for a handful of instances where instead of ending its CoT with the single token </think> it pushed the multi token sequence </thinking> at the end of its CoT block instead. Needless to say this meant that the end of the CoT block didn't get detected and the harness got confused.

Obviously this is easy enough to fix at the sampler/ KV cache level, but it'd mean hacking llama-server or implementing the openai completions API myself, which I'm not mad keen on doing. I guess I'm posting this for a couple reasons:

  • do we figure this was probably quantisation-related? I was using the iq4_nl unsloth quant at the time, with unquantised cache and recurrent state (ie no -ctk/ctv args to llama-server). FWIW this happened at arbitrary n_past positions, as low as 16k/128k or so.

  • have any of you folks seen the same thing? On the harness side it manifests as an API failure ("the model didn't return any output to our prompt") or similar.

submitted by /u/Confident_Ideal_5385
[link] [comments]