Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

Qwen3.6-27BのIQ4_XS量子化モデルが、同等のQwen3.5系で高効率だった14.7GBから15.1GBへ肥大化し、16GB VRAM環境での実用性が下がっていると指摘されています。
その主因はllama.cppの特定コミット（1dab5f5a44）で、attn_qkv層の量子化が最低Q5_Kに固定されてしまうことだと説明されています。
著者はソースコードを修正して元のIQ4_XS層量子化に1:1で戻し、比較ベンチマークを行った結果、品質低下は大きくないことを確認しています。
元に戻した14.7GB相当のカスタムモデル（GGUF）を公開し、さらに65kコンテキストでのPerplexityベンチマーク結果も提示しています。

Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests

With the release of Qwen3.6-27B, I noticed that compared to the excellent IQ4_XS quantization (14.7GB) by mradermacher for the 3.5 version (Qwen3.5-27B-i1-GGUF), the current images have bloated. The Qwen3.6 equivalent (Qwen3.6-27B-i1-GGUF) now weighs 15.1GB.

The IQ4_XS is a true "unicorn" – in all benchmarks, it offers an incredible ratio of size to model quality. In practice, it is the only viable option for running a 27B model on 16GB VRAM with a decent context. Anything lower than this is unsuitable for coding tasks. Unfortunately, the increase from 14.7GB to 15.1GB breaks the experience for 16GB cards.

The Cause & The Fix The culprit is a specific llama.cpp commit (1dab5f5a44): GitHub link. Its effect is hardcoding attn_qkv layer quantizations to a minimum of Q5_K.

To fix this, I modified the source code and replicated the original IQ4_XS layer quantization 1:1. I used the imatrix from mradermacher (Qwen3.6-27B-i1-GGUF) and performed comparative benchmarks. I observed no significant drop in model quality. In my opinion, the mentioned commit is a pure regression for the IQ4_XS format.

My custom 14.7GB model with reverted layers is available here: 👉 cHunter789/Qwen3.6-27B-i1-IQ4_XS-GGUF

Perplexity Benchmarks: 65k Context (-c 65536)

Testing parameters: pg19.txt (downloaded from Project Gutenberg here), --chunks 32, -ngl 99 (unless noted), -fa 1, -b 512, -ub 128

ID	Model Size	Model File / Version	`-ctk`	`-ctv`	Final PPL
1	15.1GB	`Qwen3.6-27B.i1-IQ4_XS.gguf` (Standard)	`q8_0`	`q8_0`	7.3765 ± 0.0276
2	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`q8_0`	`q8_0`	7.3804 ± 0.0276
3	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`q8_0`	`turbo2`	7.4260 ± 0.0277
4	15.1GB	`Qwen3.6-27B.i1-IQ4_XS.gguf` (Standard)	`q8_0`	`turbo3`	7.4069 ± 0.0277
5	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`q4_0`	`q4_0`	7.3964 ± 0.0277
6	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`turbo3`	`turbo3`	7.4317 ± 0.0279

Command lines for 65k context:

./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl -1 -ctk q8_0 -ctv turbo2 -fa 1
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q8_0 -ctv turbo3 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk q4_0 -ctv q4_0 -fa 1 -b 512 -ub 128
./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 65536 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 128

KV Cache Observations: These tests indicate that for Qwen3.6-27B, the conclusions in turboquant_plus do not apply. There is no significant benefit to increasing K-cache at the expense of V-cache. In fact, for this model, the V-cache appears equally critical.

Perplexity Benchmarks: 110k Context (-c 110000)

Based on the above, I decided to use symmetric Turbo3 quantization. Combined with my custom 14.7GB model, this optimization allowed me to achieve 110k context fully within 16GB VRAM. (This took quite a while to test, so I hope you appreciate the data!)

ID	Model Size	Model File / Version	`-ctk`	`-ctv`	Final PPL
7	14.7GB	`...-IQ4_XS-attn_qkv-IQ4_XS.gguf` (Custom)	`q8_0`	`q8_0`	7.5205 ± 0.0285
8	14.7GB	Selected Final Configuration	turbo3	turbo3	7.5758 ± 0.0287
9	15.1GB	`Qwen3.6-27B.i1-IQ4_XS.gguf` (Standard)	`turbo3`	`turbo3`	7.5727 ± 0.0287

Command lines for 110k context:
7. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 64
8. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS-attn_qkv-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256
9. ./llama-perplexity -m Qwen3.6-27B.i1-IQ4_XS.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256

The Q3 Debate

There are theories floating around that the Q3 model is fine. Judge for yourselves:

ID	Model Size	Model File / Version	`-ctk`	`-ctv`	Final PPL
10	Q3_K_L	`Qwen3.6-27B.i1-Q3_K_L.gguf`	`q8_0`	`q8_0`	7.6538 ± 0.0292
11	Q3_K_L	`Qwen3.6-27B.i1-Q3_K_L.gguf`	`turbo3`	`turbo3`	7.7085 ± 0.0295

Command lines for Q3 tests:
10. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl -1 -ctk q8_0 -ctv q8_0 -fa 1 -b 512 -ub 128
11. ./llama-perplexity -m Qwen3.6-27B.i1-Q3_K_L.gguf -f pg19.txt -c 110000 --chunks 32 -ngl 99 -ctk turbo3 -ctv turbo3 -fa 1 -b 512 -ub 256

submitted by /u/Pablo_the_brave
[link] [comments]

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Dev.to

how to use skills from Claude Code A.K.A Claudinho.

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Most People Use AI Like Google. That's Why It Sucks.

Dev.to

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context

Key Points

Qwen3.6-27B IQ4_XS Bloat: Reverting llama.cpp commit saves 16GB VRAM (14.7GB vs 15.1GB) + KVCache Tests

Perplexity Benchmarks: 65k Context (-c 65536)

Perplexity Benchmarks: 110k Context (-c 110000)

The Q3 Debate

Related Articles

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

how to use skills from Claude Code A.K.A Claudinho.

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Most People Use AI Like Google. That's Why It Sucks.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer