Hey all,
Looking for some real-world advice on GPU choices for running the new dense models — mainly Qwen 3.5 27B and Gemma 4 31B.
What I’m targeting
- Context: 64K+ (ideally higher later)
- Speed: 30+ tok/s @ tg128 minimum
- Power: not critical, but lower is a bonus
From what I’ve seen, these dense models are way more demanding than MoE.
Why not MoE?
I’m already running MoE just fine on P40s:
- Gemma 4 26B MoE
- ~32K ctx
- ~42+ tok/s @ tg128
So now I want to move to dense models for better quality / reasoning.
Budget
- ~2500 AUD (~$1800 USD)
- GPU only (already have CPU / RAM / board)
- Ignore PCIe lane limits for now
Options I’m considering
A. 2× 9070 XT (16GB)
B. 1× R9 9700 (32GB)
C. 2× 7900 XTX (24GB)
D. 1× RTX Pro 4000 (24GB)
N. 1× Intel Arc Pro B70 (32GB, maybe future option, but not now)
My current understanding (please correct me)
- 16GB cards → basically forced into pipeline parallel, so per-GPU compute matters a lot
- 2× 7900 XTX should have the best raw throughput
- RTX Pro 4000 maybe similar class, but VRAM limits context flexibility
- 32GB single card (R9 9700) is attractive for KV cache / long ctx, BUT:
- perf ≈ 9070 XT?
- price = ~2× 9070 XT + extra GPU…
- 2× 9070 XT might be best “budget parallel” option
Concerns (based on what I’ve seen here)
- KV cache is brutal on Gemma 4 31B“massive KV cache… biggest drawback”
- Even people with large VRAM struggle with higher quants / context
- 24GB seems like the minimum viable tier for 31B dense
- Long context scaling is still very hardware-sensitive
- Multi-GPU scaling (esp PCIe) seems very inconsistent depending on backend
What I want to know
If you’ve actually run Qwen3.5 27B / Gemma 4 31B (dense):
- What GPU are you using?
- What real tok/s are you getting (esp @ 64K+)
- Does multi-GPU actually scale well or just look good on paper?
- Is 32GB single GPU > dual 16/24GB in practice?
- Any regrets / “don’t buy this” advice?
Bonus question
If you had ~$1800 today, would you:
- go multi-GPU AMD (cheap + raw compute)
- or single high-VRAM card (simpler + better ctx)
Appreciate any real benchmarks / configs 🙏
[link] [comments]


