Qwen3.5の過剰思考と不安に対するダクトテープ式対処

Reddit r/LocalLLaMA / 2026/3/16

💬 オピニオンTools & Practical Usage

共有:

要点

記事は、Qwen3.5 が応答中に過剰な思考をし、洗練のループにはまりやすい傾向を論じている。
ダクトテープ式の修正案として、llama.cpp に --reasoning-budget と --reasoning-budget-message のフラグを追加し、トークン閾値を超えた時点で推論を停止し最終メッセージを付加することで、さらなる洗練を実質的に停止させる方法を提案しており、他の推論エンジンにも有効である可能性があると記している。
大きな推論予算（少なくとも1024トークン）の使用を推奨し、予算が小さすぎると回答の品質が低下する可能性があると警告している。実験的に32〜8192の範囲が試されている。
具体的なコマンドライン例を提供し、修正の適用方法と実際の挙動がどのようになるかを示している。

多くの人が、Qwen3.5 が「でも待って…」といった思考ブロックで回答を過度に考えすぎていることを不満に思っています。

最近、Qwen3.5 をよく触っており、それらを洗練のループから抜け出させるための、手早く使えるガムテープ修正を共有したいと思います（少なくとも llama.cpp では、他の推論エンジンにもおそらく有効です）：以下のように --reasoning-budget と --reasoning-budget-message のフラグを追加します。

llama-server 
 --reasoning-budget 4096 
 --reasoning-budget-message ". Okay enough thinking. Let'S just jump to it." 
 # your settings

これにより、一定のトークン閾値に達した時点で推論を停止し、末尾に予算メッセージを付加することで、さらなる改良を実質的に停止します。

思考過程が返答にそのまま漏れ出さないよう、十分な大きさの reasoning budget を設定してください。ニーズに合わせて reasoning budget を調整できます — 私は 32 から 8192 トークンの範囲を試しましたが、少なくとも 1024 を推奨します。通常、reasoning budget が低いほど、推論が適切に回答を洗練させる時間がないため、モデルはより愚かな返答をすることが多い点にご注意ください。

挙動は以下のとおりです（クイックテスト用の 256 の reasoning budget）：

$ llama-cli --fit off 
 --temp 1.0 
 --top-p 0.95 
 --top-k 20 
 --min-p 0.00 
 -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M 
 -c $((1024*16)) 
 --no-mmap 
 -ngl 99 
 --jinja 
 --reasoning-budget 256 
 --reasoning-budget-message ". Okay, enough thinking. Let's jump to it." ggml_cuda_init: found 1 ROCm devices (Total VRAM: 98304 MiB): Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, VRAM: 98304 MiB (70590 MiB free) common_download_file_single_online: no previous model file found <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_preset.ini common_download_file_single_online: HEAD failed, status: 404 no remote preset found, skipping common_download_file_single_online: using cached file (same etag): <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-Q4_K_M.gguf common_download_file_single_online: using cached file (same etag): <redacted>/unsloth_Qwen3.5-35B-A3B-GGUF_mmproj-F16.gguf Loading model... ▄▄ ▄▄ ██ ██ ██ ██ ▀▀█▄ ███▄███▄ ▀▀█▄ ▄████ ████▄ ████▄ ██ ██ ▄█▀██ ██ ██ ██ ▄█▀██ ██ ██ ██ ██ ██ ██ ██ ▀█▄██ ██ ██ ██ ▀█▄██ ██ ▀████ ████▀ ████▀ ██ ██ ▀▀ ▀▀ build : b8340-d0b79aaa2 model : unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M modalities : text, vision available commands: /exit or Ctrl+C stop or exit /regen regenerate the last response /clear clear the chat history /read add a text file /image <file> add an image file > yooo bro sup fam [Start thinking] Thinking Process: 1. **Analyze the Input:** * Text: "yooo bro sup fam" * Tone: Informal, friendly, slang-heavy, casual. * Intent: Greeting, checking in, starting a conversation. * Target Audience: ユーザーは私に「bro」や「fam」と呼びかけており、同僚のような関係を示唆しています。 2. **Determine the appropriate response:** * Tone: casualで友好的な雰囲気に合わせる。機械的すぎたり正式すぎたりしない。 * Content: 挨拶を認識し、支援を提案し、軽く保つ。 * Style: 同様のスラングや友好的な言葉遣いを使う（ただし安全ガイドラインの範囲内で）。 3. **Drafting options:** * Option 1 (Too formal): Hello. I am an AI assistant. How can I help you? (Bad match) * Option 2 (Matching slang): Yo fam! Just chillin. What's good with you? (Better) * Option 3 (Friendly & helpful): Hey! Not much, just ready to help out. What's on your mind? (Good balance) * . Okay, enough thinking. Let's jump to it. [End thinking] Yo fam! Just chillin', ready to help out. What's good with you? Anything on your mind or just wanna chat? [ Prompt: 149.7 t/s | Generation: 49.8 t/s ]

投稿者 /u/floconildo
[リンク] [コメント]

コンテンツクリエイターのためのAIプロンプト完全ガイド

Dev.to

追跡業務の自動化：フェスティバル出店者のコンプライアンスのためのAI

Dev.to

山積みからプロトコルへ：スケール時のベンダーコンプライアンスにおけるAI活用

Dev.to

MCPスキルとMCPツール: サーバーを正しく構成する方法

Dev.to

テックキャリアに4年間も費やしている

Dev.to

Qwen3.5の過剰思考と不安に対するダクトテープ式対処

要点

関連記事

コンテンツクリエイターのためのAIプロンプト完全ガイド

追跡業務の自動化：フェスティバル出店者のコンプライアンスのためのAI

山積みからプロトコルへ：スケール時のベンダーコンプライアンスにおけるAI活用

MCPスキルとMCPツール: サーバーを正しく構成する方法

テックキャリアに4年間も費やしている

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer