I have been trying various setups, quants etc for Qwen 3.6 27B and 35 A3B on my 2 x 5060 TI 16 GB setup. I am wondering if others with similar setups are seeing similar numbers, or if there is more to tweak?
So far all attempts at speculative decoding has failed with very poor performance, supposedly due to PCI-E bandwidth limits.
Measured via llama-benchy 0.3.5, --pp 4096 --tg 128 --depth 0 --runs 3 --latency-mode generation --no-cache (about to rerun again with bigger pp / tg)
Qwen3.6-27B (Dense) - Benchmark Results
| Engine | Model | Config | PP (t/s) | TG (t/s) | TTFT (ms) |
|---|---|---|---|---|---|
| vLLM | NVFP4-MTP | TP2-PP1, no spec | 1963 | 38.4 | 2182 |
| vLLM | Lorbus AutoRound | TP2-PP1, no spec | 1087 | 46.9 | 3792 |
| vLLM | Lorbus AutoRound | TP2-PP1, ngram n=3 | 1067 | 40.2 | 3914 |
| vLLM | Lorbus AutoRound | TP2-PP1, MTP n=3 | 1044 | 27.5 | 4008 |
| vLLM | Intel AutoRound | TP2-PP1, no spec | 1088 | 46.8 | 3833 |
| vLLM | Lorbus AutoRound | TP1-PP2, no spec | 1046 | 30.2 | 3995 |
| ik-llama.cpp | DavidAU IQ4_XS | layer, q8_0 KV | 1450 | 28.4 | 2945 |
| ik-llama.cpp | DavidAU IQ4_XS | tensor, f16 KV | 751 | 38.6 | 5635 |
| ik-llama.cpp | DavidAU Q5_K_M | layer, q8_0 KV | 1300 | 23.2 | 3296 |
| ik-llama.cpp | DavidAU Q5_K_M | tensor, f16 KV | 718 | 33.9 | 5894 |
Qwen3.6-35B-A3B (MoE, 3B activated) - Benchmark Results
| Engine | Model | Config | PP (t/s) | TG (t/s) | TTFT (ms) |
|---|---|---|---|---|---|
| vLLM | NVFP4 | TP2-PP1, no spec | 6259 | 116.5 | 753 |
| vLLM | NVFP4 | TP2-PP1, DFlash n=15 | 5848 | 38.9 | 779 |
| ik-llama.cpp | Unsloth Q4_K_XL | layer, q8_0 KV | 3545 | 108.9 | 1214 |
| ik-llama.cpp | Unsloth IQ4_XS | tensor, f16 KV | 2132 | 99.8 | 2036 |
[link] [comments]

