Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Reddit r/LocalLLaMA / 4/26/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article reports achieving 105–108 tokens per second (100+ tps) using the Qwen3.6-27B-INT4 AutoRound model with a native 256k context window.
The setup runs on a single RTX 5090 GPU using vLLM 0.19 and focuses on configuration choices that maintain full 256k-length performance.
It highlights that MTP is supported and that KLD quality is described as good, especially compared with NVFP4, while also benefiting from the smaller quantized model size.
The author notes they did not apply TQ because the model already reaches the maximum native context length without it.
A detailed vLLM launch configuration is provided, including FlashInfer attention backend, fp8_e4m3 KV cache dtype, auto_round quantization, and MTP speculative decoding parameters.

Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG).

Model: https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound

- MTP supported

- KLD is decent (much better than NVFP4 per the linked post) with the benefit of being the smallest model

- The smaller model size allows for full native 256k context window

Tokens per second (TG): 105-108 tps

Special credits to this post that helps me discover the Lorbus quant: https://www.reddit.com/r/Olares/comments/1svg2ad/qwen3627b_at_85100_ts_on_a_24gb_rtx_5090_laptop/

Note that I didn't mess with TQ in my setup as I can already hit the max context length native to the model without it.

Vllm launch config:

args=(

vllm serve "/root/autodl-tmp/llm-models"

--max-model-len "262144"

--gpu-memory-utilization "0.93"

--attention-backend flashinfer

--performance-mode interactivity

--language-model-only

--kv-cache-dtype "fp8_e4m3"

--max-num-seqs "2"

--skip-mm-profiling

--quantization auto_round

--reasoning-parser qwen3

--enable-auto-tool-choice

--enable-prefix-caching

--enable-chunked-prefill

--tool-call-parser qwen3_coder

--speculative-config '{"method":"mtp","num_speculative_tokens":3}'

--host "0.0.0.0"

--port "6006"

)

submitted by /u/Kindly-Cantaloupe978
[link] [comments]

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/26DailyView insight →

Black Hat USA

AI Business

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

"Beating 250,000 Mental Comparisons: A Cross-Domain Engineer's Entity Resolution Case Study"

Dev.to

I built a Claude Code skill that turns negative competitor reviews into a roadmap

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Qwen3.6-27B-INT4 clocking 100 tps with 256k context length on 1x RTX 5090 via vllm 0.19

Key Points

💡 Insights using this article

Related Articles

Black Hat USA

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

"Beating 250,000 Mental Comparisons: A Cross-Domain Engineer's Entity Resolution Case Study"

I built a Claude Code skill that turns negative competitor reviews into a roadmap

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer