Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • Qwen3.6-27BのIQ4_XS量子化モデルが、同等のQwen3.5系で高効率だった14.7GBから15.1GBへ肥大化し、16GB VRAM環境での実用性が下がっていると指摘されています。
  • その主因はllama.cppの特定コミット(1dab5f5a44)で、attn_qkv層の量子化が最低Q5_Kに固定されてしまうことだと説明されています。
  • 著者はソースコードを修正して元のIQ4_XS層量子化に1:1で戻し、比較ベンチマークを行った結果、品質低下は大きくないことを確認しています。
  • 元に戻した14.7GB相当のカスタムモデル(GGUF)を公開し、さらに65kコンテキストでのPerplexityベンチマーク結果も提示しています。

Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests

With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 version (Qwen3.5-27B-i1-GGUF), the current images have bloated. The Qwen3.6 equivalent (Qwen3.6-27B-i1-GGUF) now weighs 15.1GB.

The IQ4_XS is a true "unicorn" – in all benchmarks, it offers an incredible ratio of size to model quality. In practice, it is the only viable option for running a 27B model on 16GB VRAM with a decent context. Anything lower than this is unsuitable for coding tasks. Unfortunately, the increase from 14.7GB to 15.1GB breaks the experience for 16GB cards.

The Cause & The Fix The culprit is a specific llama.cpp commit (1dab5f5a44): GitHub link. Its effect is hardcoding attn_qkv layer quantizations to a minimum of Q5_K.

To fix this, I modified the source code and replicated the original IQ4_XS layer quantization 1:1. I used the imatrix from mradermacher (Qwen3.6-27B-i1-GGUF) and performed comparative benchmarks. I observed no significant drop in model quality. In my opinion, the mentioned commit is a pure regression for the IQ4_XS format.

My custom 14.7GB model with reverted layers is available here: 👉 cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF

Perplexity Benchmarks: 65k Context (-c 65536)

Testing parameters: pg19.txt (downloaded from Project Gutenberg here), --chunks 32, -ngl 99 (unless noted), -fa 1, -b 512, -ub 128

ID Model Size Model File / Version -ctk -ctv Final PPL
1 15.1GB Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) q8_0 q8_0 7.3765 ± 0.0276
2 14.7GB ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) q8_0 q8_0 7.3804 ± 0.0276
3 14.7GB ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) q8_0 turbo2 7.4260 ± 0.0277
4 15.1GB Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) q8_0 turbo3 7.4069 ± 0.0277
5 14.7GB ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) q4_0 q4_0 7.3964 ± 0.0277
6 14.7GB ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) turbo3 turbo3 7.4317 ± 0.0279

Command lines for 65k context:

  1. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
  2. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
  3. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv turbo2 -fa 1
  4. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q8_0 -ctv turbo3 -fa 1 -b 512 -ub 128
  5. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 128
  6. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 128

KV Cache Observations: These tests indicate that for Qwen3.6-27B, the conclusions in turboquant_plus do not apply. There is no significant benefit to increasing K-cache at the expense of V-cache. In fact, for this model, the V-cache appears equally critical.

Perplexity Benchmarks: 110k Context (-c 110000)

Based on the above, I decided to use symmetric Turbo3 quantization. Combined with my custom 14.7GB model, this optimization allowed me to achieve 110k context fully within 16GB VRAM. (This took quite a while to test, so I hope you appreciate the data!)

ID Model Size Model File / Version -ctk -ctv Final PPL
7 14.7GB ...-IQ4_XS-attn_qkv-IQ4_XS.gguf (Custom) q8_0 q8_0 7.5205 ± 0.0285
8 14.7GB Selected Final Configuration turbo3 turbo3 7.5758 ± 0.0287
9 15.1GB Qwen3.6-27B.i1-IQ4_XS.gguf (Standard) turbo3 turbo3 7.5727 ± 0.0287

Command lines for 110k context:
7. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 64
8. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
9. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256

The Q3 Debate

There are theories floating around that the Q3 model is fine. Judge for yourselves:

ID Model Size Model File / Version -ctk -ctv Final PPL
10 Q3_K_L Qwen3.6-27B.i1-Q3_K_L.gguf q8_0 q8_0 7.6538 ± 0.0292
11 Q3_K_L Qwen3.6-27B.i1-Q3_K_L.gguf turbo3 turbo3 7.7085 ± 0.0295

Command lines for Q3 tests:
10. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
11. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256

submitted by /u/Pablo_the_brave
[link] [comments]