Qwen3.5-27B on RTX 5090 served via vLLM @ 77 tps

Reddit r/LocalLLaMA / 4/21/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A Reddit user reports successfully running Qwen3.5-27B on an RTX 5090 locally with vLLM, achieving very strong throughput (~77 tps) and supporting a 218k context window.
  • They note that full 256k context window support was not achievable on vLLM 0.19 with their setup, while vLLM 0.17 worked but delivered lower tps due to fewer optimizations.
  • The setup relies on specific guidance from a Hugging Face model card plus a critical vLLM patch to fix KV cache size calculations (ref. vLLM PR #36325).
  • The provided vLLM serving configuration includes key performance/compatibility flags such as flashinfer attention backend, FP8 KV cache dtype, auto tool choice, prefix caching, and quantization via modelopt, and supports up to 2 concurrent sequences with expected per-session slowdown.
  • The user also cautions that one tested model variant did not work well, recommending a particular Qwen3.5-27B Text NVFP4 MTP checkpoint that has the tradeoff of lacking image processing.

After maxing out my cursor $20 sub and zai $10 sub for this month, I have resorted to a local llm setup. Got good outcome on RTX5090 running Qwen3.5 27B and achieved very good tps. Context window at 218k. It can even run 2 concurrent sessions with this config although per session speed drops as expected. For some reason i can't get it to work at full context window of 256k on vllm 0.19, it works on vllm 0.17 per the guide below but tps suffers as 0.17 doesn't have many of the optimization that vllm 0.19 has apparently.

Recipe:

vllm 0.19 (see recipe https://huggingface.co/mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4); note that this model from my test doesn't work very well so don't recommend using it; but the guide in the model card is quite useful.

Patch to fix KV size calcs for vllm https://github.com/vllm-project/vllm/pull/36325 (**this is super critical)

model: osoleve/Qwen3.5-27B-Text-NVFP4-MTP from hugging face (** this works quite well with the shortcoming of no image processing)

cli: opencode

vllm config:

vllm serve "Qwen3.5-27B-Text-NVFP4-MTP"

--max-model-len "218592"

--gpu-memory-utilization "0.93"

--attention-backend flashinfer

--performance-mode interactivity

--language-model-only

--kv-cache-dtype "fp8_e4m3"

--max-num-seqs "2"

--skip-mm-profiling

--quantization modelopt

--reasoning-parser qwen3

--chat-template "/root/autodl-tmp/llm-start/qwen3.5-enhanced.jinja"

--enable-auto-tool-choice

--enable-prefix-caching

--tool-call-parser qwen3_coder (** from my test it works better than qwen3_xml)

--host "0.0.0.0"

--port "6006"

submitted by /u/Kindly-Cantaloupe978
[link] [comments]