Qwen3.6 GGUF Benchmarks

Reddit r/LocalLLaMA / 4/18/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Read original →

共有:

Key Points

Unslothは、Qwen3.6-35B-A3BのGGUFに対するKLD性能ベンチマークを実施し、ディスク使用量あたりのKLDが最良となる量子化（quant）選定を支援したと述べています。
彼らはGGUF更新に関する誤解（自社ミスでの頻繁な再アップロード、CUDAの問題を言い訳としている等）を否定し、更新理由の多くは外部要因であるため透明性をもって情報公開していると説明しています。
具体例として、Gemma 4は4回再アップロードされ、そのうち3回はllama.cppのバグ修正（複数のPR改善）に起因し、4回目はGoogleのチャットテンプレート改善によるもので、全提供者が更新対応したとしています。
MiniMax 2.7ではNaNが複数quantで観測され、Unslothは修正版を適用済み、他提供者（Bartowski）も修正作業中であると報告しています。

Hey guys, we ran Qwen3.6-35B-A3B GGUF KLD performance benchmarks to help you choose the best quant.

Unsloth quants have the best KLD vs disk space 21/22 times on the pareto frontier.

GGUFs: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

We also want to clear up a few misunderstandings around our GGUF updates. Some people have said we re-upload often because of our own mistakes, or that issues like CUDA 13.2 gibberish are just excuses.

We understand the concern, but the reality is that we tend to publicize issues quickly and tell people to update. In roughly 95% of cases, the root causes were out of our hands - we just try to be transparent and keep the community informed.

A few examples:

Gemma 4 was re-uploaded 4 times

Three were due to about 10 to 20 llama.cpp bug fixes, some of which we helped investigate and contribute a fix as well. The fourth was an official Gemma chat template improvement from Google. Every provider had to update, not just us. See llama.cpp PRs which shows ~30 PR fixes / improvements for Gemma-4

MiniMax 2.7 NaNs

We found NaNs in 38% of Bartowski’s (10/26 quants) and 22% of ours (5/23 quants).

We identified a fix and already patched ours - see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax_m27_gguf_investigation_fixes_benchmarks/ Bartowski has not patched yet, but is actively working on it.

10/26 NaNs (38%) found at https://huggingface.co/bartowski/MiniMaxAI_MiniMax-M2.7-GGUF: Chunk-32 failures (9): IQ3_XXS, IQ3_XS, IQ3_M, Q3_K_M, Q3_K_L, Q3_K_XL, Q4_K_S, Q4_1, Q5_K_S. Late failure (1): IQ1_S (crashed at chunk 311)
5/23 NaNs (21%) ours had NaNs - all fixed now at https://huggingface.co/unsloth/MiniMax-M2.7-GGUF: UD-Q4_K_S, UD-Q4_K_M, UD-Q4_K_XL, UD-Q5_K_S, MXFP4_MOE. All block 32.
AesSedai's Q4_K_M at https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF was re-provided with our Q6_K trick.

Qwen3.5 SSM issues

We shared 7TB of research artifacts showing which layers should not be quantized. The issue was not that providers’ quants were broken, but that they were not optimal - mainly around `ssm_out` and `ssm_*` tensors. We have since improved ours and now lead on KLD vs. disk space for Qwen3.5 as well.

Most if not all quant providers then take our findings then update their quants. We talked about our analysis and research at https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwen3535ba3b_unsloth_dynamic_ggufs_benchmarks/ and https://www.reddit.com/r/LocalLLaMA/comments/1rlkptk/final_qwen35_unsloth_gguf_update/

CUDA 13.2 is actually broken

This causes some low bit quants on all models to get gibberish. Some people have dismissed it as not being an issue, but NVIDIA has confirmed it's a problem and a fix is coming in CUDA 13.3. See Unsloth Issue 4849, llama.cpp issue 21255, issue 21371

As a temporary solution use CUDA 13.1. See https://github.com/ggml-org/llama.cpp/issues/21255#issuecomment-4248403175 quote from https://github.com/johnnynunez: