2 x 5060 ti: Any better configs for Qwen 3.6 27B / 35B?

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The user is testing different quantization and speculative decoding configurations for Qwen 3.6 27B (dense) and Qwen 3.6 35B A3B (MoE) on a dual RTX 5060 Ti 16GB setup, and asks whether others see similar results or additional tuning opportunities.
Their initial speculative decoding attempts performed very poorly, which they attribute to suspected PCIe bandwidth limitations.
For Qwen 3.6 27B, the posted llama-benchy results show vLLM with NVFP4-MTP achieving the highest measured throughput among the listed runs (about 1963 t/s PP and 38.4 t/s TG) with comparatively high TTFT.
In the same 27B table, vLLM configurations using Lorbus or Intel AutoRound yield much lower PP (around ~1087–1067 t/s) and similarly higher TTFT, while llama.cpp (ik-llama.cpp) runs improve PP for some quant/KV setups but often trade off TTFT.
The user provides benchmark methodology details (generation latency mode, no-cache, and parameters like pp/tg/depth/runs) and plans to rerun with larger pp/tg to validate the findings.

I have been trying various setups, quants etc for Qwen 3.6 27B and 35 A3B on my 2 x 5060 TI 16 GB setup. I am wondering if others with similar setups are seeing similar numbers, or if there is more to tweak?

So far all attempts at speculative decoding has failed with very poor performance, supposedly due to PCI-E bandwidth limits.

Measured via llama-benchy 0.3.5, --pp 4096 --tg 128 --depth 0 --runs 3 --latency-mode generation --no-cache (about to rerun again with bigger pp / tg)

Qwen3.6-27B (Dense) - Benchmark Results

Engine	Model	Config	PP (t/s)	TG (t/s)	TTFT (ms)
vLLM	NVFP4-MTP	TP2-PP1, no spec	1963	38.4	2182
vLLM	Lorbus AutoRound	TP2-PP1, no spec	1087	46.9	3792
vLLM	Lorbus AutoRound	TP2-PP1, ngram n=3	1067	40.2	3914
vLLM	Lorbus AutoRound	TP2-PP1, MTP n=3	1044	27.5	4008
vLLM	Intel AutoRound	TP2-PP1, no spec	1088	46.8	3833
vLLM	Lorbus AutoRound	TP1-PP2, no spec	1046	30.2	3995
ik-llama.cpp	DavidAU IQ4_XS	layer, q8_0 KV	1450	28.4	2945
ik-llama.cpp	DavidAU IQ4_XS	tensor, f16 KV	751	38.6	5635
ik-llama.cpp	DavidAU Q5_K_M	layer, q8_0 KV	1300	23.2	3296
ik-llama.cpp	DavidAU Q5_K_M	tensor, f16 KV	718	33.9	5894

Qwen3.6-35B-A3B (MoE, 3B activated) - Benchmark Results

Engine	Model	Config	PP (t/s)	TG (t/s)	TTFT (ms)
vLLM	NVFP4	TP2-PP1, no spec	6259	116.5	753
vLLM	NVFP4	TP2-PP1, DFlash n=15	5848	38.9	779
ik-llama.cpp	Unsloth Q4_K_XL	layer, q8_0 KV	3545	108.9	1214
ik-llama.cpp	Unsloth IQ4_XS	tensor, f16 KV	2132	99.8	2036

submitted by /u/ziphnor
[link] [comments]

Black Hat USA

AI Business

I built Dispatch AI. I just wanted to share it. If you find it cool, take a look and leave a comment.

Dev.to

Replit AI Agent: Practical Guide for Dev Workflows

Dev.to

Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks

VentureBeat

Building My Own AI Coding Agent From Scratch: A Learning Journey

Dev.to

2 x 5060 ti: Any better configs for Qwen 3.6 27B / 35B?

Key Points

Qwen3.6-27B (Dense) - Benchmark Results

Qwen3.6-35B-A3B (MoE, 3B activated) - Benchmark Results

Related Articles

Black Hat USA

I built Dispatch AI. I just wanted to share it. If you find it cool, take a look and leave a comment.

Replit AI Agent: Practical Guide for Dev Workflows

Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks

Building My Own AI Coding Agent From Scratch: A Learning Journey

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer