A lot of people are complaining about Qwen3.5 overthinking answers with their "But wait..." thinking blocks.
I've been playing around with Qwen3.5 a lot lately and wanted to share a quick duct tape fix to get them out of the refining loop (at least in llama.cpp, probably works for other inference engines too): add the flags --reasoning-budget and --reasoning-budget-message like so:
llama-server \ --reasoning-budget 4096 \ --reasoning-budget-message ". Okay enough thinking. Let's just jump to it." \ # your settings This will stop the reasoning when it reaches a certain token threshold and append the budget message at the end of it, effectively shutting down further refinements.
Make sure to add a big enough reasoning budget so the thinking process doesn't just spill in the response. You can play around with the reasoning budget to fit your needs — I've tried from 32 to 8192 tokens and I recommend at least 1024. Note that usually the lower your reasoning budget is, the dumber the model gets as it won't have time to refine proper their answers.
Here's how it behaves (256 reasoning budget for a quick test):
$ llama-cli --fit off \ --temp 1.0 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.00 \ -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \ -c $((1024*16)) \ --no-mmap \ -ngl 99 \ --jinja \ --reasoning-budget 256 \ --reasoning-budget-message ". Okay, enough thinking. Let's jump to it." ggml_cuda_init: found 1 ROCm devices (Total VRAM: 98304 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 98304 MiB (70590 MiB free) common_download_file_single_online: no previous model file found <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_preset.ini common_download_file_single_online: HEAD failed, status: 404 no remote preset found, skipping common_download_file_single_online: using cached file (same etag): <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_M.gguf common_download_file_single_online: using cached file (same etag): <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_mmproj-F16.gguf Loading model... ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b8340-d0b79aaa2 model : unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M modalities : text, vision available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read add a text file /image <file> add an image file > yooo bro sup fam [Start thinking] Thinking Process: 1. **Analyze the Input:** * Text: "yooo bro sup fam" * Tone: Informal, friendly, slang-heavy, casual. * Intent: Greeting, checking in, starting a conversation. * Target Audience: The user is addressing me as "bro" and "fam," suggesting a peer-like relationship. 2. **Determine the appropriate response:** * Tone: Match the casual, friendly vibe. Don't be too robotic or formal. * Content: Acknowledge the greeting, offer assistance, keep it light. * Style: Use similar slang or friendly language (but stay within safety guidelines). 3. **Drafting options:** * Option 1 (Too formal): Hello. I am an AI assistant. How can I help you? (Bad match) * Option 2 (Matching slang): Yo fam! Just chillin. What's good with you? (Better) * Option 3 (Friendly & helpful): Hey! Not much, just ready to help out. What's on your mind? (Good balance) * . Okay, enough thinking. Let's jump to it. [End thinking] Yo fam! Just chillin', ready to help out. What's good with you? Anything on your mind or just wanna chat? [ Prompt: 149.7 t/s | Generation: 49.8 t/s ] [link] [comments]




