----START HUMAN TEXT----
Hi all,
I've seen a bunch of posts about squeezing 27B onto a 24GB card and all the quantization tricks involved in doing so. It's all amazing work, but at the end of the day a quantized model with quantized KV will inevitably compound errors faster than non-quantized ones, which noticeably impacts agentic coding.
I figured a 48GB GPU offered just enough VRAM to avoid most of the quantization nastiness with genuinely good options, like Blackwell-accelerated FP8. Luckily, Qwen released their own FP8 variant of the 27B model.
I'm serious when I say: I think we might have an answer to all those "what do I buy for $10k?" posts. A pro5k, 64GB RAM, a decent CPU/mobo, and it will run the FP8 quant of 27B with Blackwell hardware acceleration and non-quantized KV like a champ. It's quiet, cool enough, small, fast... really great.
The end recipe:
- vLLM 0.20.1
- CUDA 12.9
- Qwen's official FP8 quant of Qwen3.6 27B which gives all the features of Qwen3.6 like multi-modality, MTP, etc.
- BF16 KV cache with 200k tokens @ 1.09x concurrency
- Real benchmark numbers to follow - they're running now.
These settings:
export VLLM_USE_FLASHINFER_MOE_FP8=1 export VLLM_TEST_FORCE_FP8_MARLIN=1 export VLLM_SLEEP_WHEN_IDLE=1 export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 export VLLM_LOG_STATS_INTERVAL=2 export VLLM_WORKER_MULTIPROC_METHOD=spawn export SAFETENSORS_FAST_GPU=1 export CUDA_DEVICE_ORDER=PCI_BUS_ID export TORCH_FLOAT32_MATMUL_PRECISION=high export PYTORCH_ALLOC_CONF=expandable_segments:True vllm serve Qwen/Qwen3.6-27B-FP8 \ --host 0.0.0.0 --port 8080 \ --performance-mode interactivity \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --reasoning-parser qwen3 \ --mm-encoder-tp-mode data \ --mm-processor-cache-type shm \ --gpu-memory-utilization 0.975 \ --speculative-config '{"method":"mtp","num_speculative_tokens":2}' \ --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "max_cudagraph_capture_size": 16, "mode": "VLLM_COMPILE"}' \ --async-scheduling \ --attention-backend flashinfer \ --max-model-len 196608 \ --kv-cache-dtype bfloat16 \ --enable-prefix-caching Performance
I'm running real benchmarks right now and will update this post later, but in general: writing code with MTP=2 yields 60-90 TPS, which is a number I find perfectly acceptable for daily use. Furthermore, because we're running the FP8 and KV is non-quantized we get the benefits of long Claude sessions without early compaction, endless loops, etc. It's truly minimally quantized.
----END HUMAN TEXT----
If there were AI-generated text it would follow here.
----START AI TEXT----
----END AI TEXT----
[link] [comments]




