Thanks to the community the Qwen3.6-27B speed keeps getting better. The following improves upon my recipe from yesterday and delivered a whopping 100+ tps (TG).
Model: https://huggingface.co/Lorbus/Qwen3.6-27B-int4-AutoRound
- MTP supported
- KLD is decent (much better than NVFP4 per the linked post) with the benefit of being the smallest model
- The smaller model size allows for full native 256k context window
Tokens per second (TG): 105-108 tps
Special credits to this post that helps me discover the Lorbus quant: https://www.reddit.com/r/Olares/comments/1svg2ad/qwen3627b_at_85100_ts_on_a_24gb_rtx_5090_laptop/
Note that I didn't mess with TQ in my setup as I can already hit the max context length native to the model without it.
Vllm launch config:
args=(
vllm serve "/root/autodl-tmp/llm-models"
--max-model-len "262144"
--gpu-memory-utilization "0.93"
--attention-backend flashinfer
--performance-mode interactivity
--language-model-only
--kv-cache-dtype "fp8_e4m3"
--max-num-seqs "2"
--skip-mm-profiling
--quantization auto_round
--reasoning-parser qwen3
--enable-auto-tool-choice
--enable-prefix-caching
--enable-chunked-prefill
--tool-call-parser qwen3_coder
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
--host "0.0.0.0"
--port "6006"
)
[link] [comments]




