Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests
With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 version (Qwen3.5-27B-i1-GGUF), the current images have bloated. The Qwen3.6 equivalent (Qwen3.6-27B-i1-GGUF) now weighs 15.1GB.
The IQ4_XS is a true "unicorn" – in all benchmarks, it offers an incredible ratio of size to model quality. In practice, it is the only viable option for running a 27B model on 16GB VRAM with a decent context. Anything lower than this is unsuitable for coding tasks. Unfortunately, the increase from 14.7GB to 15.1GB breaks the experience for 16GB cards.
The Cause & The Fix The culprit is a specific llama.cpp commit (1dab5f5a44): GitHub link. Its effect is hardcoding attn_qkv layer quantizations to a minimum of Q5_K.
To fix this, I modified the source code and replicated the original IQ4_XS layer quantization 1:1. I used the imatrix from mradermacher (Qwen3.6-27B-i1-GGUF) and performed comparative benchmarks. I observed no significant drop in model quality. In my opinion, the mentioned commit is a pure regression for the IQ4_XS format.
My custom 14.7GB model with reverted layers is available here: 👉 cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF
Perplexity Benchmarks: 65k Context (-c 65536)
Testing parameters: pg19.txt (downloaded from Project Gutenberg here), --chunks 32, -ngl 99 (unless noted), -fa 1, -b 512, -ub 128
| ID | Model Size | Model File / Version | -ctk | -ctv | Final PPL |
|---|---|---|---|---|---|
| 1 | 15.1GB | Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) | q8_0 | q8_0 | 7.3765 ± 0.0276 |
| 2 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) | q8_0 | q8_0 | 7.3804 ± 0.0276 |
| 3 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) | q8_0 | turbo2 | 7.4260 ± 0.0277 |
| 4 | 15.1GB | Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) | q8_0 | turbo3 | 7.4069 ± 0.0277 |
| 5 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) | q4_0 | q4_0 | 7.3964 ± 0.0277 |
| 6 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) | turbo3 | turbo3 | 7.4317 ± 0.0279 |
Command lines for 65k context:
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv turbo2 -fa 1./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q8_0 -ctv turbo3 -fa 1 -b 512 -ub 128./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 128./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 128
KV Cache Observations: These tests indicate that for Qwen3.6-27B, the conclusions in turboquant_plus do not apply. There is no significant benefit to increasing K-cache at the expense of V-cache. In fact, for this model, the V-cache appears equally critical.
Perplexity Benchmarks: 110k Context (-c 110000)
Based on the above, I decided to use symmetric Turbo3 quantization. Combined with my custom 14.7GB model, this optimization allowed me to achieve 110k context fully within 16GB VRAM. (This took quite a while to test, so I hope you appreciate the data!)
| ID | Model Size | Model File / Version | -ctk | -ctv | Final PPL |
|---|---|---|---|---|---|
| 7 | 14.7GB | ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) | q8_0 | q8_0 | 7.5205 ± 0.0285 |
| 8 | 14.7GB | Selected Final Configuration | turbo3 | turbo3 | 7.5758 ± 0.0287 |
| 9 | 15.1GB | Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) | turbo3 | turbo3 | 7.5727 ± 0.0287 |
Command lines for 110k context:
7. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 64
8. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
9. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
The Q3 Debate
There are theories floating around that the Q3 model is fine. Judge for yourselves:
| ID | Model Size | Model File / Version | -ctk | -ctv | Final PPL |
|---|---|---|---|---|---|
| 10 | Q3_K_L | Qwen3.6-27B.i1-Q3_K_L.gguf | q8_0 | q8_0 | 7.6538 ± 0.0292 |
| 11 | Q3_K_L | Qwen3.6-27B.i1-Q3_K_L.gguf | turbo3 | turbo3 | 7.7085 ± 0.0295 |
Command lines for Q3 tests:
10. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
11. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
[link] [comments]
