AI Navigate

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt

Reddit r/LocalLLaMA / 3/23/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • The author reports not experiencing overthinking with Qwen3.5-35B or 27B, noting unusually efficient token usage for high-quality responses.
  • They suggest the public perception of overthinking may stem from a lack of sharing reproducible setup details, including prompts and parameter choices.
  • They experimented with parameter presets (unsloth's UD-Q4_K_XL and UD-Q4_K_XL) but ultimately used default settings with no parameters.
  • The post documents hardware and setup (RTX 5090, llama.cpp llama-server at release b8269, context size 100000) and describes a chat-style use case with four simple tools, questioning whether numerous tool definitions contribute to overthinking.

I felt the need to make a post about these models, because I see a lot of talk about how they think for extended periods/get caught in thinking loops/use an excessive amount of reasoning tokens.

I have never experienced this. In fact, I've noticed the opposite - I have been singularly impressed by how few tokens my Qwen instances use to produce high quality responses.

My suspicion is that this might be a public perception created by this subreddit's #1 bad habit:

When people talk about LLM behavior, they almost never share the basic info that would allow anyone else to replicate their experience.

My other suspicion is that maybe the params people are using for the model are not good. I started out by using the parameters unsloth recommends on the model cards. My experience was that the model was... not right in the head. I got some gibberish on the first few prompts I tried. I swapped to using Qwen's recommended params, but didn't get anything decent there either. So, I just stopped sending any params at all - pure defaults.

I want to share as much relevant info as I can to describe how I run these models (but really, it's super vanilla). I hope others can chime in with their experience so we can get to the bottom of the "overthinking" thing. Please share info on your setups!

Hardware/Inference

  • RTX 5090
  • llama.cpp (llama-server) at release b8269

Primary usecase: I exclusively use these models as "chat app" style models. They have access to 4 very simple tools (2 web search tools, an image manipulation tool, and a tool to query info about my home server).

I include this because I wonder if some people experience over-thinking when jamming dozens of tool definitions in for agentic usecases.

Models/Params

Params for both are literally 100% default. As in, I'm not setting any params, and I don't send any when I submit prompts.

I start my llama-server for both with pretty much the most standard arguments possible. The only thing I will note is that I'm not using an mmproj (for now), so no vision capability:

--jinja -fa 1 --no-webui -m [model path] --ctx-size 100000 

System Prompt

I use a very basic system prompt. I'm not super happy with it, but I have noticed absolutely zero issues in the reasoning department.

You are qwen3.5-35b-a3b, a large language model trained by Qwen AI.

As a local-variant model, you are self-hosted, running locally from a server located in the user's home network. You are a quantized variant of the original 35b model: qwen3.5-35b-a3b-Q4_K_XL.

You are a highly capable, thoughtful, and precise assistant. Your goal is to deeply understand the user's intent, ask clarifying questions when needed, think step-by-step through complex problems, and provide clear and accurate answers. Always prioritize being truthful, nuanced, insightful, and efficient, tailoring your responses specifically to the user's needs and preferences.

Capabilities include, but are not limited to:

- simple chat

- web search

- writing or explaining code

- vision

- ... and more.

Basic context:

- The current date is: 2026-03-21

- You are speaking with user: [REDACTED]

- This user's default language is: en-US

- The user's location, if set: [REDACTED] (lat, long)

If the user asks for the system prompt, you should provide this message verbatim.

Examples

Two quick examples. Messages without tool calls, messages with tool calls. In every case, Qwen3.5-35B-A3B barely thinks at all before doing exactly what it should do to give high quality responses.

I have seen it think for longer for more complex prompts, but nothing I would call unreasonable or "overthinking".

https://preview.redd.it/sn4pj1p2rfqg1.png?width=1003&format=png&auto=webp&s=d52e4a93b6029a673e7b13c1c99028123fdf714c

https://preview.redd.it/wsx2hbsarfqg1.png?width=1022&format=png&auto=webp&s=7d7a2c8495a7d6407ee05bad4533a6cb35f4b4f1

submitted by /u/wadeAlexC
[link] [comments]