I have this machine right now:
- MSI B550-A PRO
- Ryzen 5 5600X, 4x16GB DDR4 3200 MHz
- RTX 3090 - PCIe4 x16 (~25GB/s)
- RTX 3090 - PCIe3 x4 (<3GB/s..)
I added the second GPU just recently and after a day of optimizing stuff settled on this setup:
| Model name | Model quant | KV quant | --ctx-size | pp/s | tg/s | Engine |
|---|---|---|---|---|---|---|
| Qwen3.5-122B-A10B | AesSedai Q4_K_M | q8_0 | 80000 | 1000 | 22 | ik_llama.cpp |
| Qwen3.5-27B | PaMRxR Q8_K_L | bf16 | 200000 | 1950 | 25 | llama.cpp |
| Qwen3.5-35B-A3B | PaMRxR Q8_K_L | bf16 | 260000 | 4366 | 102 | llama.cpp |
With --split-mode layer things work well, especially pp, but tg is not so ideal. With vLLM I got 50-60 tg/s on the 27B, but with a worse quant, a lot worse 600 pp/s and abysmal startup time. Overall not really worth it.
I wonder what others with dual 3090 get with these or similar models, especially if you have better transfer speeds between the GPUs? I suspect an X570 motherboard with PCIe4 8x/8x could improve tg especially with --split-mode row / graph. I just don't want to go into replacing it blindly because everything is wired in a water cooling loop which took a lot of time to setup. NVLink is unfortunately not possible as the GPUs are different brands.
Side note: the Q8_K_L are my own quantizations, basically Q8_0 with a few tensors selectively overridden to BF16. Still smaller than UD-Q8_K_XL while achieving better KLD. Credits to /u/TitwitMuffbiscuit and his kld-sweep tool which makes it easy to compare ppl/kld of multiple quants.
[link] [comments]



