Qwen3.6 27B FP8 runs with 200k tokens of BF16 KV cache at 80 TPS on a single RTX 5000 PRO 48GB

Reddit r/LocalLLaMA / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author describes running Qwen3.6 27B in FP8 on a single RTX 5000 PRO 48GB, using BF16 KV cache sized for 200k tokens at 80 TPS.
They argue that avoiding quantized KV helps reduce compounding errors and improves “agentic coding” performance versus setups that quantize both model weights and KV.
A practical single-GPU deployment recipe is provided, centered on vLLM 0.20.1, CUDA 12.9, Qwen’s official Qwen3.6-27B-FP8 weights, and FlashInfer attention back-end.
The configuration targets high interactivity and sets multi-modal/agent tooling options (e.g., auto tool choice, specific tool/reasoning parsers) while enabling prefix caching and CUDA graph compilation settings.
Preliminary results indicate that using MTP with 2 speculative tokens (MTP=2) yields about 60–90 TPS for code writing, and the author plans to share full benchmark numbers later.

----START HUMAN TEXT----

Hi all,

I've seen a bunch of posts about squeezing 27B onto a 24GB card and all the quantization tricks involved in doing so. It's all amazing work, but at the end of the day a quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding.

I figured a 48GB GPU offered just enough VRAM to avoid most of the quantization nastiness with genuinely good options, like Blackwell-accelerated FP8. Luckily, Qwen released their own FP8 variant of the 27B model.

I'm serious when I say: I think we might have an answer to all those "what do I buy for $10k?" posts. A pro5k, 64GB RAM, a decent CPU/mobo, and it will run the FP8 quant of 27B with Blackwell hardware acceleration and non-quantized KV like a champ. It's quiet, cool enough, small, fast... really great.

The end recipe:

vLLM 0.20.1
CUDA 12.9
Qwen's official FP8 quant of Qwen3.6 27B which gives all the features of Qwen3.6 like multi-modality, MTP, etc.
BF16 KV cache with 200k tokens @ 1.09x concurrency
Real benchmark numbers to follow - they're running now.

These settings:

export VLLM_USE_FLASHINFER_MOE_FP8=1 export VLLM_TEST_FORCE_FP8_MARLIN=1 export VLLM_SLEEP_WHEN_IDLE=1 export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 export VLLM_LOG_STATS_INTERVAL=2 export VLLM_WORKER_MULTIPROC_METHOD=spawn export SAFETENSORS_FAST_GPU=1 export CUDA_DEVICE_ORDER=PCI_BUS_ID export TORCH_FLOAT32_MATMUL_PRECISION=high export PYTORCH_ALLOC_CONF=expandable_segments:True vllm serve Qwen/Qwen3.6-27B-FP8 \ --host 0.0.0.0 --port 8080 \ --performance-mode interactivity \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --mm-encoder-tp-mode data \ --mm-processor-cache-type shm \ --gpu-memory-utilization 0.975 \ --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \ --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "max_cudagraph_capture_size": 16, "mode": "VLLM_COMPILE"}' \ --async-scheduling \ --attention-backend flashinfer \ --max-model-len 196608 \ --kv-cache-dtype bfloat16 \ --enable-prefix-caching

Performance

I'm running real benchmarks right now and will update this post later, but in general: writing code with MTP=2 yields 60-90 TPS, which is a number I find perfectly acceptable for daily use. Furthermore, because we're running the FP8 and KV is non-quantized we get the benefits of long Claude sessions without early compaction, endless loops, etc. It's truly minimally quantized.

----END HUMAN TEXT----

If there were AI-generated text it would follow here.

----START AI TEXT----

----END AI TEXT----

submitted by /u/__JockY__
[link] [comments]