Qwen3.5-4B GGUF quants comparison (KLD vs speed) - Lunar Lake

Reddit r/LocalLLaMA / 4/6/2026

💬 OpinionTools & Practical UsageModels & Research

Key Points

  • Intel Core Ultra “Lunar Lake” (258V, iGPU 140V, 18GB)環境でQwen3.5-4BのGGUF量子化を複数種類テストし、「速度(tk/s)」と品質指標の「KLD」を比較した。
  • KLD≤0.01を基準にすると、主にQ4_0系(例:Q4_0やQ4_K_Sなど)が良い結果になり、低KLDと比較的良好な速度の両立が示唆された。
  • より圧縮の強い量子化(例:Q2_Kや各種XXS/UD系、極端に低精度側)はKLDが大きくなりやすく、品質面でのトレードオフが確認できる。
  • 本投稿者は小型クオンタイゼーションの結果がより大きいモデルにも一般化するかを意図しており、同一マシン上での実測に基づく選定の有用性を示している。
Qwen3.5-4B GGUF quants comparison (KLD vs speed) - Lunar Lake

I wanted to know which type of quant is the best on this laptop (Intel 258V - iGPU 140V 18GB), so I tested all these small quants hoping that it generalizes to bigger models:

Winners in bold (KLD≤0.01)

Uploader Quant tk/s KLD GB KLD/GB*
mradermacher* Q4_0 28.97 0.052659918 2.37 0.04593
mradermacher_i1 Q4_0 28.89 0.059171561 2.37 0.05162
mradermacher_i1 IQ3_XXS 28.59 0.177140713 1.77 0.20736
Unsloth UD-IQ2_XXS 28.47 0.573673327 1.42 0.83747
Unsloth Q4_0 28.3 0.053431218 2.41 0.04583
Bartowski Q4_0 28.28 0.049796789 2.45 0.04200
mradermacher Q4_K_S 27.74 0.050305722 2.39 0.04350
Unsloth Q4_K_S 27.29 0.028402815 2.41 0.02429
Unsloth UD-IQ3_XXS 27.03 0.146879419 1.82 0.16718
mradermacher Q2_K 26.98 0.858648176 1.78 1.00000
mradermacher_i1 Q4_K_M 25.95 0.026540567 2.52 0.02169
mradermacher_i1 IQ3_XS 25.89 0.147214121 1.93 0.15800
Unsloth Q3_K_M 25.68 0.071933741 2.14 0.06955
mradermacher Q4_K_M 25.65 0.045641299 2.52 0.03741
Unsloth Q4_1 25.55 0.027891336 2.59 0.02219
mradermacher_i1 Q4_1 25.37 0.026074872 2.58 0.02081
mradermacher_i1 Q3_K_M 25.3 0.097725191 2.11 0.09588
Unsloth Q4_K_M 25.24 0.025038545 2.55 0.02022
mradermacher Q3_K_M 25.11 0.134816481 2.11 0.13233
Bartowski Q4_K_M 25.04 0.021567758 2.67 0.01661
mradermacher_i1 Q4_K_S 24.79 0.029635327 2.39 0.02557
mradermacher* Q5_0 24.68 0.016011348 2.78 0.01180
Unsloth UD-Q2_K_XL 24.47 0.257632552 1.81 0.29497
Unsloth UD-Q3_K_XL 24.28 0.060193337 2.27 0.05484
mradermacher Q5_K_S 24.03 0.014901354 2.78 0.01097
mradermacher_i1 IQ3_M 24.03 0.12177067 2.01 0.12547
mradermacher Q3_K_L 23.84 0.13041761 2.26 0.11950
mradermacher_i1 Q3_K_L 23.66 0.090757172 2.26 0.08312
Unsloth UD-Q4_K_XL 23.49 0.021954506 2.71 0.01665
mradermacher Q5_K_M 23.24 0.013006221 2.86 0.00929
Unsloth Q5_K_S 23.17 0.009194176 2.82 0.00662
mradermacher_i1 Q5_K_S 22.78 0.009151312 2.78 0.00668
Unsloth Q3_K_S 22.76 0.131018266 1.96 0.13845
Bartowski Q5_K_S 22.71 0.007777943 2.91 0.00540
mradermacher_i1 Q3_K_S 22.71 0.154451808 1.93 0.16578
Unsloth Q5_K_M 22.46 0.008185137 2.93 0.00565
mradermacher_i1 Q5_K_M 22.2 0.008807971 2.86 0.00624
mradermacher_i1 IQ4_NL 22.11 0.035745155 2.43 0.03036
Unsloth IQ4_NL 22.06 0.033689086 2.4 0.02896
mradermacher* Q5_1 22.04 0.011970632 2.99 0.00816
Unsloth UD-Q5_K_XL 22.01 0.008566809 3.03 0.00572
mradermacher Q3_K_S 21.96 0.209124569 1.93 0.22451
Bartowski Q5_K_M 21.91 0.006410029 3.09 0.00416
mradermacher_i1 IQ4_XS 21.61 0.043640734 2.34 0.03853
Unsloth IQ4_XS 21.59 0.033083008 2.31 0.02955
mradermacher IQ4_XS 21.58 0.037995139 2.36 0.03324
Bartowski IQ4_XS 21.26 0.036717438 2.35 0.03225
mradermacher Q6_K 20.59 0.005153856 3.23 0.00317
mradermacher_i1 Q6_K 20.3 0.005765065 3.23 0.00356
Unsloth Q6_K 20.24 0.003640111 3.28 0.00216
Unsloth UD-IQ2_M 19.16 0.290956558 1.64 0.36769
Bartowski Q6_K 19.15 0.003466296 3.4 0.00197
Bartowski Q6_K_L 18.79 0.002772501 3.54 0.00148
Unsloth UD-Q6_K_XL 18.5 0.002394357 3.86 0.00114
mradermacher Q8_0 18.15 0.000762229 4.17 0.00024
mradermacher* MXFP4_MOE 18.13 0.000762229 4.17 0.00024
Unsloth Q8_0 18.09 0.000778796 4.17 0.00025
Bartowski Q8_0 18.08 0.000809347 4.19 0.00026
Unsloth UD-Q8_K_XL 12.28 0.000378562 5.54 0.00000

Notes:
- I used ThrottleStop + HWiNFO64 to fix CPU PL1 at 25W, with a 5s cooling delay between benches.
- The KDL came from llama-cpp-python + wikitext-test.txt, with base logits from mdradermacher's static BF16.
- Speed is from llama-bench.
- Used -fa 0 -ngl 99 --no-mmap which make a speed difference. But ctk/ctv was always worse.
- Also used -b 512 -ub 512 which always has the best PP/TG. Found by scanning: llama-bench.exe -m model.gguf -p 512 -n 128 -b 2048,1024,512,256,128,64,32 -ub 2048,1024,512,256,128,64,32 -fa 0 --mmap 0 -ngl 99

* Yellow GGUFs are manually quantized from mdradermacher's static quants (he didn't provide the full set). All other GUFFs were downloaded manually. (I also tried llama-quantize's MXFP4_MOE mode but realized afterwards this model isn't MOE, so it looks like another Q8_0. Would it even have ran on Intel?).

submitted by /u/Tryshea
[link] [comments]