So the Qwen vocab has distinct tokens for <think> and </think>. I know this because an app I wrote pushes those tokens to the cache after <|im_start|>assistant to stop CoT selectively. Great.
Yesterday I was fucking around with some coding harnesses and qwen 3.6 A3B running in llama-server, and it worked rather well except for a handful of instances where instead of ending its CoT with the single token </think> it pushed the multi token sequence </thinking> at the end of its CoT block instead. Needless to say this meant that the end of the CoT block didn't get detected and the harness got confused.
Obviously this is easy enough to fix at the sampler/ KV cache level, but it'd mean hacking llama-server or implementing the openai completions API myself, which I'm not mad keen on doing. I guess I'm posting this for a couple reasons:
do we figure this was probably quantisation-related? I was using the iq4_nl unsloth quant at the time, with unquantised cache and recurrent state (ie no -ctk/ctv args to llama-server). FWIW this happened at arbitrary n_past positions, as low as 16k/128k or so.
have any of you folks seen the same thing? On the harness side it manifests as an API failure ("the model didn't return any output to our prompt") or similar.
[link] [comments]




