2 x 5060 ti: Any better configs for Qwen 3.6 27B / 35B?

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The user is testing different quantization and speculative decoding configurations for Qwen 3.6 27B (dense) and Qwen 3.6 35B A3B (MoE) on a dual RTX 5060 Ti 16GB setup, and asks whether others see similar results or additional tuning opportunities.
  • Their initial speculative decoding attempts performed very poorly, which they attribute to suspected PCIe bandwidth limitations.
  • For Qwen 3.6 27B, the posted llama-benchy results show vLLM with NVFP4-MTP achieving the highest measured throughput among the listed runs (about 1963 t/s PP and 38.4 t/s TG) with comparatively high TTFT.
  • In the same 27B table, vLLM configurations using Lorbus or Intel AutoRound yield much lower PP (around ~1087–1067 t/s) and similarly higher TTFT, while llama.cpp (ik-llama.cpp) runs improve PP for some quant/KV setups but often trade off TTFT.
  • The user provides benchmark methodology details (generation latency mode, no-cache, and parameters like pp/tg/depth/runs) and plans to rerun with larger pp/tg to validate the findings.

I have been trying various setups, quants etc for Qwen 3.6 27B and 35 A3B on my 2 x 5060 TI 16 GB setup. I am wondering if others with similar setups are seeing similar numbers, or if there is more to tweak?

So far all attempts at speculative decoding has failed with very poor performance, supposedly due to PCI-E bandwidth limits.

Measured via llama-benchy 0.3.5, --pp 4096 --tg 128 --depth 0 --runs 3 --latency-mode generation --no-cache (about to rerun again with bigger pp / tg)

Qwen3.6-27B (Dense) - Benchmark Results

Engine Model Config PP (t/s) TG (t/s) TTFT (ms)
vLLM NVFP4-MTP TP2-PP1, no spec 1963 38.4 2182
vLLM Lorbus AutoRound TP2-PP1, no spec 1087 46.9 3792
vLLM Lorbus AutoRound TP2-PP1, ngram n=3 1067 40.2 3914
vLLM Lorbus AutoRound TP2-PP1, MTP n=3 1044 27.5 4008
vLLM Intel AutoRound TP2-PP1, no spec 1088 46.8 3833
vLLM Lorbus AutoRound TP1-PP2, no spec 1046 30.2 3995
ik-llama.cpp DavidAU IQ4_XS layer, q8_0 KV 1450 28.4 2945
ik-llama.cpp DavidAU IQ4_XS tensor, f16 KV 751 38.6 5635
ik-llama.cpp DavidAU Q5_K_M layer, q8_0 KV 1300 23.2 3296
ik-llama.cpp DavidAU Q5_K_M tensor, f16 KV 718 33.9 5894

Qwen3.6-35B-A3B (MoE, 3B activated) - Benchmark Results

Engine Model Config PP (t/s) TG (t/s) TTFT (ms)
vLLM NVFP4 TP2-PP1, no spec 6259 116.5 753
vLLM NVFP4 TP2-PP1, DFlash n=15 5848 38.9 779
ik-llama.cpp Unsloth Q4_K_XL layer, q8_0 KV 3545 108.9 1214
ik-llama.cpp Unsloth IQ4_XS tensor, f16 KV 2132 99.8 2036
submitted by /u/ziphnor
[link] [comments]