Qwen3.5 overthinking anxiety duct tape fix

Reddit r/LocalLLaMA / 3/16/2026

💬 OpinionTools & Practical Usage

共有:

Key Points

The article discusses Qwen3.5's tendency to overthink and get stuck in refining loops during responses.
It proposes a duct-tape fix by adding the flags --reasoning-budget and --reasoning-budget-message in llama.cpp to stop reasoning after a token threshold and append a final message, effectively halting further refinements (and notes it may work for other inference engines too).
It recommends using a large reasoning budget (at least 1024 tokens) and warns that too small a budget can degrade answer quality, with tested ranges from 32 to 8192.
It provides concrete command-line examples showing how to apply the fix and what the behavior looks like in practice.

A lot of people are complaining about Qwen3.5 overthinking answers with their "But wait..." thinking blocks.

I've been playing around with Qwen3.5 a lot lately and wanted to share a quick duct tape fix to get them out of the refining loop (at least in llama.cpp, probably works for other inference engines too): add the flags --reasoning-budget and --reasoning-budget-message like so:

llama-server \ --reasoning-budget 4096 \ --reasoning-budget-message ". Okay enough thinking. Let's just jump to it." \ # your settings

This will stop the reasoning when it reaches a certain token threshold and append the budget message at the end of it, effectively shutting down further refinements.

Make sure to add a big enough reasoning budget so the thinking process doesn't just spill in the response. You can play around with the reasoning budget to fit your needs — I've tried from 32 to 8192 tokens and I recommend at least 1024. Note that usually the lower your reasoning budget is, the dumber the model gets as it won't have time to refine proper their answers.

Here's how it behaves (256 reasoning budget for a quick test):

$ llama-cli --fit off \ --temp 1.0 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \ -c $((1024*16)) \ --no-mmap \ -ngl 99 \ --jinja \ --reasoning-budget 256 \ --reasoning-budget-message ". Okay, enough thinking. Let's jump to it." ggml_cuda_init: found 1 ROCm devices (Total VRAM: 98304 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 98304 MiB (70590 MiB free) common_download_file_single_online: no previous model file found <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_preset.ini common_download_file_single_online: HEAD failed, status: 404 no remote preset found, skipping common_download_file_single_online: using cached file (same etag): <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_M.gguf common_download_file_single_online: using cached file (same etag): <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_mmproj-F16.gguf Loading model... ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b8340-d0b79aaa2 model : unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M modalities : text, vision available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read add a text file /image <file> add an image file > yooo bro sup fam [Start thinking] Thinking Process: 1. **Analyze the Input:** * Text: "yooo bro sup fam" * Tone: Informal, friendly, slang-heavy, casual. * Intent: Greeting, checking in, starting a conversation. * Target Audience: The user is addressing me as "bro" and "fam," suggesting a peer-like relationship. 2. **Determine the appropriate response:** * Tone: Match the casual, friendly vibe. Don't be too robotic or formal. * Content: Acknowledge the greeting, offer assistance, keep it light. * Style: Use similar slang or friendly language (but stay within safety guidelines). 3. **Drafting options:** * Option 1 (Too formal): Hello. I am an AI assistant. How can I help you? (Bad match) * Option 2 (Matching slang): Yo fam! Just chillin. What's good with you? (Better) * Option 3 (Friendly & helpful): Hey! Not much, just ready to help out. What's on your mind? (Good balance) * . Okay, enough thinking. Let's jump to it. [End thinking] Yo fam! Just chillin', ready to help out. What's good with you? Anything on your mind or just wanna chat? [ Prompt: 149.7 t/s | Generation: 49.8 t/s ]

submitted by /u/floconildo
[link] [comments]

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

Ledge.ai

The programming passion is melting

Dev.to

Best AI Tools for Property Managers in 2026

Dev.to

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Dev.to

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

Dev.to

Qwen3.5 overthinking anxiety duct tape fix

Key Points

Related Articles

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

Manus、AIエージェントをデスクトップ化 ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像

The programming passion is melting

Best AI Tools for Property Managers in 2026

Building “The Sentinel” – AI Parametric Insurance at Guidewire DEVTrails

Maximize Developer Revenue with Monetzly's Innovative API for AI Conversations

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Manus、AIエージェントをデスクトップ化ローカルPC上でファイルやアプリを直接操作可能にのサムネイル画像