Qwen 3.6 27b IQ4_XS - 22 tp/s on RTX 5060TI 16b, 24k ctx

Reddit r/LocalLLaMA / 4/24/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A user reports running Qwen 3.6 27B IQ4_XS in llama-server, achieving about 22 tokens per second on an RTX 5060 Ti with a 16GB setup.
  • The model can reportedly reach up to 24k context length, with KV-quantization constraints limiting higher KV quant levels beyond 8k context.
  • Users tuned runtime parameters such as -ub and -b to extend usable context to 16k, and additional system memory (e.g., disabling GNOME) helped reach the 24k limit.
  • Despite loading roughly 63/65 layers at the selected quantization, the user considers the quality/performance tradeoff acceptable for Q4 quantization.
  • The quantized GGUF file used was produced via Unsloth from a Hugging Face source link.

Maybe it be helpful for someone:
llama-server -m '/Qwen3.6-27B/Qwen3.6-27B-IQ4_XS.gguf' -ngl 999 -ctk q4_0 -ctv q4_0 -b 128 -ub 128 -c 24000

Cant run this model with higher kv quants on >8192ctx size.
-ub & -b setted for 256 allowed me for max 16384 ctx

The max sized for ctx i get is 24k. Disabled gnome let me use additional 300MiB.

Its kinda nice, but ik that is very low usefull in many case.

This GPU load 63/65 layers in this quants without quant context. But its still q4 so i think that is good enough.

I used unsloth quant: https://huggingface.co/unsloth/Qwen3.6-27B-GGUF?show_file_info=Qwen3.6-27B-IQ4_XS.gguf

submitted by /u/BazzyIm
[link] [comments]