After maxing out my cursor $20 sub and zai $10 sub for this month, I have resorted to a local llm setup. Got good outcome on RTX5090 running Qwen3.5 27B and achieved very good tps. Context window at 218k. It can even run 2 concurrent sessions with this config although per session speed drops as expected. For some reason i can't get it to work at full context window of 256k on vllm 0.19, it works on vllm 0.17 per the guide below but tps suffers as 0.17 doesn't have many of the optimization that vllm 0.19 has apparently.
Recipe:
vllm 0.19 (see recipe https://huggingface.co/mconcat/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-NVFP4); note that this model from my test doesn't work very well so don't recommend using it; but the guide in the model card is quite useful.
Patch to fix KV size calcs for vllm https://github.com/vllm-project/vllm/pull/36325 (**this is super critical)
model: osoleve/Qwen3.5-27B-Text-NVFP4-MTP from hugging face (** this works quite well with the shortcoming of no image processing)
cli: opencode
vllm config:
vllm serve "Qwen3.5-27B-Text-NVFP4-MTP"
--max-model-len "218592"
--gpu-memory-utilization "0.93"
--attention-backend flashinfer
--performance-mode interactivity
--language-model-only
--kv-cache-dtype "fp8_e4m3"
--max-num-seqs "2"
--skip-mm-profiling
--quantization modelopt
--reasoning-parser qwen3
--chat-template "/root/autodl-tmp/llm-start/qwen3.5-enhanced.jinja"
--enable-auto-tool-choice
--enable-prefix-caching
--tool-call-parser qwen3_coder (** from my test it works better than qwen3_xml)
--host "0.0.0.0"
--port "6006"
[link] [comments]




