Ive been waiting for sokann to drop his Qwen 3.6 GGUF for 16 GB GPUs as his Qwen 3.5 was my GGUF of choice. I tried cHunter789's Qwen3.6-27B-i1-IQ4_XS-GGUF that was posted yesterday, but could only achieve a context window of 30000 while staying in VRAM.
With the same launch settings, I am able to achieve a 50000 context window with this GGUF, which is quite the increase. You Linux/headless guys should be able to get some more out of it too.
The Hugging Face model card shows that this quant is the most VRAM-efficient option at just 4.256 BPW (~13.3 GB), with average perplexity nearly identical to the others (6.99 vs ~6.95–7.02). The fidelity metrics do show it has measurably higher probability distortion (RMS Δp ~6.7% vs ~4.3%, top-p match ~90.3% vs ~94%), but these gaps are modest and typical of aggressive 4-bit compression.
Ive posted my launch arguments here if you want to take a look.
Does anyone know if Id be better off sticking with Qwen3.6-35B-A3B Q6_K over this lower quant of a dense model? The MoE has the advantage of larger context window due to RAM spillage not destroying performance. But if this is likely better, I can use it for small tasks and switch back to 35B when I required the larger context.
Also, they made a Qwen3.6-27B-GGUF-5.076bpw for 24 GB cards if anyone wants to give that a look.
[link] [comments]


