Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working

Reddit r/LocalLLaMA / 4/29/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A tester reports running Qwen3.6 27B locally on dual RTX 5060 Ti 16GB GPUs using vLLM nightly, achieving roughly ~60 tokens per second depending on context length and speculative decoding settings.
  • They configure vLLM with tensor-parallel size 2, fp8 KV cache, modelopt quantization, and a 204,800 max model length; 204k context begins working but remains tight on 32GB total VRAM.
  • Performance tests show ~50–52 tok/s at 8K context with MTP n=1, improving to ~62–66 tok/s with MTP n=3, while 32K context also lands around ~59–66 tok/s.
  • The 204k “prefill” and needle/retrieval smoke test succeeds (including a ~256s test after 168k tokens), and the server correctly rejects prompts that exceed the 204800 window.
  • Key caveats include limited headroom at 204k, the need for gpu_memory_utilization≈0.95 (0.94 fails KV allocation), several minutes of startup due to compile/autotune, and low concurrency because max_num_seqs=1.

I’ve been testing Qwen3.6 27B on a pretty non-standard local setup and figured the numbers might be useful for anyone looking at the newer 16GB Blackwell cards.

Hardware:

  • 2x RTX 5060 Ti 16GB
  • 32GB total VRAM
  • Proxmox LXC
  • 16 vCPU
  • ~60GB RAM
  • CUDA 13 / Torch 2.11 nightly
  • vLLM nightly: 0.19.2rc1.dev
  • Model: sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP

vLLM launch shape:

vllm serve sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP \ --served-model-name qwen36-nvfp4-mtp \ --tensor-parallel-size 2 \ --max-model-len 204800 \ --max-num-batched-tokens 8192 \ --max-num-seqs 1 \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --quantization modelopt \ --speculative-config '{"method":"mtp","num_speculative_tokens":3}' \ --reasoning-parser qwen3 \ --language-model-only \ --generation-config vllm \ --disable-custom-all-reduce \ --attention-backend TRITON_ATTN 

Performance so far:

  • 8K context, MTP n=1: ~50–52 tok/s
  • 8K context, MTP n=3: ~62–66 tok/s
  • 32K context: ~59–66 tok/s
  • 204800 context starts and works, but is tight
  • Idle VRAM at 204k: ~14.45GiB per GPU
  • After a 168k-token prefill: ~15.65GiB per GPU
  • 168k-token needle/retrieval smoke test passed in ~256s
  • Near-limit test correctly rejected prompt+output over the 204800 window

Thinking mode works too, but you need to give it enough output budget. With low max_tokens, Qwen can spend the whole cap on reasoning and return no final content. Around 1024+ is fine for small prompts, and 4096–8192 is safer for actual reasoning tasks.

Caveats:

  • 204k context is right on the edge with 2x16GB.
  • gpu_memory_utilization=0.94 failed KV allocation; 0.95 worked.
  • Startup takes several minutes due to compile/autotune.
  • Logs show FlashInfer autotuner OOM fallbacks during startup, but the server still becomes healthy.
  • I had better luck with TRITON_ATTN for the text path.
  • This is not a high-concurrency config: max_num_seqs=1.

Overall: dual 5060 Ti 16GB seems surprisingly usable for Qwen3.6 27B if you use the right checkpoint/runtime combo. It’s not roomy, but it works.

submitted by /u/do_u_think_im_spooky
[link] [comments]