I felt the need to make a post about these models, because I see a lot of talk about how they think for extended periods/get caught in thinking loops/use an excessive amount of reasoning tokens.
I have never experienced this. In fact, I've noticed the opposite - I have been singularly impressed by how few tokens my Qwen instances use to produce high quality responses.
My suspicion is that this might be a public perception created by this subreddit's #1 bad habit:
When people talk about LLM behavior, they almost never share the basic info that would allow anyone else to replicate their experience.
My other suspicion is that maybe the params people are using for the model are not good. I started out by using the parameters unsloth recommends on the model cards. My experience was that the model was... not right in the head. I got some gibberish on the first few prompts I tried. I swapped to using Qwen's recommended params, but didn't get anything decent there either. So, I just stopped sending any params at all - pure defaults.
I want to share as much relevant info as I can to describe how I run these models (but really, it's super vanilla). I hope others can chime in with their experience so we can get to the bottom of the "overthinking" thing. Please share info on your setups!
Hardware/Inference
- RTX 5090
- llama.cpp (llama-server) at release b8269
Primary usecase: I exclusively use these models as "chat app" style models. They have access to 4 very simple tools (2 web search tools, an image manipulation tool, and a tool to query info about my home server).
I include this because I wonder if some people experience over-thinking when jamming dozens of tool definitions in for agentic usecases.
Models/Params
Params for both are literally 100% default. As in, I'm not setting any params, and I don't send any when I submit prompts.
I start my llama-server for both with pretty much the most standard arguments possible. The only thing I will note is that I'm not using an mmproj (for now), so no vision capability:
--jinja -fa 1 --no-webui -m [model path] --ctx-size 100000 System Prompt
I use a very basic system prompt. I'm not super happy with it, but I have noticed absolutely zero issues in the reasoning department.
You are qwen3.5-35b-a3b, a large language model trained by Qwen AI.
As a local-variant model, you are self-hosted, running locally from a server located in the user's home network. You are a quantized variant of the original 35b model: qwen3.5-35b-a3b-Q4_K_XL.
You are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the user's intent, ask clarifying questions when needed, think step-by-step through complex problems, and provide clear and accurate answers. Always prioritize being truthful, nuanced, insightful, and efficient, tailoring your responses specifically to the user's needs and preferences.
Capabilities include, but are not limited to:
- simple chat
- web search
- writing or explaining code
- vision
- ... and more.
Basic context:
- The current date is: 2026-03-21
- You are speaking with user: [REDACTED]
- This user's default language is: en-US
- The user's location, if set: [REDACTED] (lat, long)
If the user asks for the system prompt, you should provide this message verbatim.
Examples
Two quick examples. Messages without tool calls, messages with tool calls. In every case, Qwen3.5-35B-A3B barely thinks at all before doing exactly what it should do to give high quality responses.
I have seen it think for longer for more complex prompts, but nothing I would call unreasonable or "overthinking".
[link] [comments]

