Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article reports achieving 105–108 tokens per second (100+ tps) using the Qwen3.6-27B-INT4 AutoRound model with a native 256k context window.
  • The setup runs on a single RTX 5090 GPU using vLLM 0.19 and focuses on configuration choices that maintain full 256k-length performance.
  • It highlights that MTP is supported and that KLD quality is described as good, especially compared with NVFP4, while also benefiting from the smaller quantized model size.
  • The author notes they did not apply TQ because the model already reaches the maximum native context length without it.
  • A detailed vLLM launch configuration is provided, including FlashInfer attention backend, fp8_e4m3 KV cache dtype, auto_round quantization, and MTP speculative decoding parameters.

Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG).

Model: https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound

- MTP supported

- KLD is decent (much better than NVFP4 per the linked post) with the benefit of being the smallest model

- The smaller model size allows for full native 256k context window

Tokens per second (TG): 105-108 tps

Special credits to this post that helps me discover the Lorbus quant: https://www.reddit.com/r/Olares/comments/1svg2ad/qwen3627b_at_85100_ts_on_a_24gb_rtx_5090_laptop/

Note that I didn't mess with TQ in my setup as I can already hit the max context length native to the model without it.

Vllm launch config:

args=(

vllm serve "/root/autodl-tmp/llm-models"

--max-model-len "262144"

--gpu-memory-utilization "0.93"

--attention-backend flashinfer

--performance-mode interactivity

--language-model-only

--kv-cache-dtype "fp8_e4m3"

--max-num-seqs "2"

--skip-mm-profiling

--quantization auto_round

--reasoning-parser qwen3

--enable-auto-tool-choice

--enable-prefix-caching

--enable-chunked-prefill

--tool-call-parser qwen3_coder

--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

--host "0.0.0.0"

--port "6006"

)

submitted by /u/Kindly-Cantaloupe978
[link] [comments]