I wanted to know which type of quant is the best on this laptop (Intel 258V - iGPU 140V 18GB), so I tested all these small quants hoping that it generalizes to bigger models:
Winners in bold (KLD≤0.01)
| Uploader | Quant | tk/s | KLD | GB | KLD/GB* |
| mradermacher* | Q4_0 | 28.97 | 0.052659918 | 2.37 | 0.04593 |
| mradermacher_i1 | Q4_0 | 28.89 | 0.059171561 | 2.37 | 0.05162 |
| mradermacher_i1 | IQ3_XXS | 28.59 | 0.177140713 | 1.77 | 0.20736 |
| Unsloth | UD-IQ2_XXS | 28.47 | 0.573673327 | 1.42 | 0.83747 |
| Unsloth | Q4_0 | 28.3 | 0.053431218 | 2.41 | 0.04583 |
| Bartowski | Q4_0 | 28.28 | 0.049796789 | 2.45 | 0.04200 |
| mradermacher | Q4_K_S | 27.74 | 0.050305722 | 2.39 | 0.04350 |
| Unsloth | Q4_K_S | 27.29 | 0.028402815 | 2.41 | 0.02429 |
| Unsloth | UD-IQ3_XXS | 27.03 | 0.146879419 | 1.82 | 0.16718 |
| mradermacher | Q2_K | 26.98 | 0.858648176 | 1.78 | 1.00000 |
| mradermacher_i1 | Q4_K_M | 25.95 | 0.026540567 | 2.52 | 0.02169 |
| mradermacher_i1 | IQ3_XS | 25.89 | 0.147214121 | 1.93 | 0.15800 |
| Unsloth | Q3_K_M | 25.68 | 0.071933741 | 2.14 | 0.06955 |
| mradermacher | Q4_K_M | 25.65 | 0.045641299 | 2.52 | 0.03741 |
| Unsloth | Q4_1 | 25.55 | 0.027891336 | 2.59 | 0.02219 |
| mradermacher_i1 | Q4_1 | 25.37 | 0.026074872 | 2.58 | 0.02081 |
| mradermacher_i1 | Q3_K_M | 25.3 | 0.097725191 | 2.11 | 0.09588 |
| Unsloth | Q4_K_M | 25.24 | 0.025038545 | 2.55 | 0.02022 |
| mradermacher | Q3_K_M | 25.11 | 0.134816481 | 2.11 | 0.13233 |
| Bartowski | Q4_K_M | 25.04 | 0.021567758 | 2.67 | 0.01661 |
| mradermacher_i1 | Q4_K_S | 24.79 | 0.029635327 | 2.39 | 0.02557 |
| mradermacher* | Q5_0 | 24.68 | 0.016011348 | 2.78 | 0.01180 |
| Unsloth | UD-Q2_K_XL | 24.47 | 0.257632552 | 1.81 | 0.29497 |
| Unsloth | UD-Q3_K_XL | 24.28 | 0.060193337 | 2.27 | 0.05484 |
| mradermacher | Q5_K_S | 24.03 | 0.014901354 | 2.78 | 0.01097 |
| mradermacher_i1 | IQ3_M | 24.03 | 0.12177067 | 2.01 | 0.12547 |
| mradermacher | Q3_K_L | 23.84 | 0.13041761 | 2.26 | 0.11950 |
| mradermacher_i1 | Q3_K_L | 23.66 | 0.090757172 | 2.26 | 0.08312 |
| Unsloth | UD-Q4_K_XL | 23.49 | 0.021954506 | 2.71 | 0.01665 |
| mradermacher | Q5_K_M | 23.24 | 0.013006221 | 2.86 | 0.00929 |
| Unsloth | Q5_K_S | 23.17 | 0.009194176 | 2.82 | 0.00662 |
| mradermacher_i1 | Q5_K_S | 22.78 | 0.009151312 | 2.78 | 0.00668 |
| Unsloth | Q3_K_S | 22.76 | 0.131018266 | 1.96 | 0.13845 |
| Bartowski | Q5_K_S | 22.71 | 0.007777943 | 2.91 | 0.00540 |
| mradermacher_i1 | Q3_K_S | 22.71 | 0.154451808 | 1.93 | 0.16578 |
| Unsloth | Q5_K_M | 22.46 | 0.008185137 | 2.93 | 0.00565 |
| mradermacher_i1 | Q5_K_M | 22.2 | 0.008807971 | 2.86 | 0.00624 |
| mradermacher_i1 | IQ4_NL | 22.11 | 0.035745155 | 2.43 | 0.03036 |
| Unsloth | IQ4_NL | 22.06 | 0.033689086 | 2.4 | 0.02896 |
| mradermacher* | Q5_1 | 22.04 | 0.011970632 | 2.99 | 0.00816 |
| Unsloth | UD-Q5_K_XL | 22.01 | 0.008566809 | 3.03 | 0.00572 |
| mradermacher | Q3_K_S | 21.96 | 0.209124569 | 1.93 | 0.22451 |
| Bartowski | Q5_K_M | 21.91 | 0.006410029 | 3.09 | 0.00416 |
| mradermacher_i1 | IQ4_XS | 21.61 | 0.043640734 | 2.34 | 0.03853 |
| Unsloth | IQ4_XS | 21.59 | 0.033083008 | 2.31 | 0.02955 |
| mradermacher | IQ4_XS | 21.58 | 0.037995139 | 2.36 | 0.03324 |
| Bartowski | IQ4_XS | 21.26 | 0.036717438 | 2.35 | 0.03225 |
| mradermacher | Q6_K | 20.59 | 0.005153856 | 3.23 | 0.00317 |
| mradermacher_i1 | Q6_K | 20.3 | 0.005765065 | 3.23 | 0.00356 |
| Unsloth | Q6_K | 20.24 | 0.003640111 | 3.28 | 0.00216 |
| Unsloth | UD-IQ2_M | 19.16 | 0.290956558 | 1.64 | 0.36769 |
| Bartowski | Q6_K | 19.15 | 0.003466296 | 3.4 | 0.00197 |
| Bartowski | Q6_K_L | 18.79 | 0.002772501 | 3.54 | 0.00148 |
| Unsloth | UD-Q6_K_XL | 18.5 | 0.002394357 | 3.86 | 0.00114 |
| mradermacher | Q8_0 | 18.15 | 0.000762229 | 4.17 | 0.00024 |
| mradermacher* | MXFP4_MOE | 18.13 | 0.000762229 | 4.17 | 0.00024 |
| Unsloth | Q8_0 | 18.09 | 0.000778796 | 4.17 | 0.00025 |
| Bartowski | Q8_0 | 18.08 | 0.000809347 | 4.19 | 0.00026 |
| Unsloth | UD-Q8_K_XL | 12.28 | 0.000378562 | 5.54 | 0.00000 |
Notes:
- I used ThrottleStop + HWiNFO64 to fix CPU PL1 at 25W, with a 5s cooling delay between benches.
- The KDL came from llama-cpp-python + wikitext-test.txt, with base logits from mdradermacher's static BF16.
- Speed is from llama-bench.
- Used -fa 0 -ngl 99 --no-mmap which make a speed difference. But ctk/ctv was always worse.
- Also used -b 512 -ub 512 which always has the best PP/TG. Found by scanning: llama-bench.exe -m model.gguf -p 512 -n 128 -b 2048,1024,512,256,128,64,32 -ub 2048,1024,512,256,128,64,32 -fa 0 --mmap 0 -ngl 99
* Yellow GGUFs are manually quantized from mdradermacher's static quants (he didn't provide the full set). All other GUFFs were downloaded manually. (I also tried llama-quantize's MXFP4_MOE mode but realized afterwards this model isn't MOE, so it looks like another Q8_0. Would it even have ran on Intel?).
submitted by