GPU advice for Qwen 3.5 27B / Gemma 4 31B (dense) — aiming for 64K ctx, 30+ t/s

Reddit r/LocalLLaMA / 4/16/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The post asks for practical GPU recommendations to run dense 27B/31B models (Qwen 3.5 27B and Gemma 4 31B) with 64K+ context and at least ~30 tok/s at tg128.
The author contrasts dense vs MoE performance, noting that they currently run MoE models comfortably on P40 GPUs but find dense models “way more demanding.”
They outline a shortlist of GPU configurations (dual 16GB 9070 XT, single 32GB R9 9700, dual 24GB 7900 XTX, single 24GB RTX Pro 4000, and an optional Arc Pro B70) and highlight constraints around VRAM, KV cache, and multi-GPU scaling.
Key concerns include the “brutal” KV cache for Gemma 4 31B, uncertainty about long-context scaling, and inconsistent multi-GPU behavior depending on the software/backend.
The post requests real-world benchmarks (tok/s at 64K+), guidance on whether 32GB single-GPU outperforms dual smaller-VRAM setups, and advice on what not to buy given an ~$1800 USD budget.

Hey all,

Looking for some real-world advice on GPU choices for running the new dense models — mainly Qwen 3.5 27B and Gemma 4 31B.

What I’m targeting

Context: 64K+ (ideally higher later)
Speed: 30+ tok/s @ tg128 minimum
Power: not critical, but lower is a bonus

From what I’ve seen, these dense models are way more demanding than MoE.

Why not MoE?

I’m already running MoE just fine on P40s:

Gemma 4 26B MoE
~32K ctx
~42+ tok/s @ tg128

So now I want to move to dense models for better quality / reasoning.

Budget

~2500 AUD (~$1800 USD)
GPU only (already have CPU / RAM / board)
Ignore PCIe lane limits for now

Options I’m considering

A. 2× 9070 XT (16GB)
B. 1× R9 9700 (32GB)
C. 2× 7900 XTX (24GB)
D. 1× RTX Pro 4000 (24GB)

N. 1× Intel Arc Pro B70 (32GB, maybe future option, but not now)

My current understanding (please correct me)

16GB cards → basically forced into pipeline parallel, so per-GPU compute matters a lot
2× 7900 XTX should have the best raw throughput
RTX Pro 4000 maybe similar class, but VRAM limits context flexibility
32GB single card (R9 9700) is attractive for KV cache / long ctx, BUT:
- perf ≈ 9070 XT?
- price = ~2× 9070 XT + extra GPU…
2× 9070 XT might be best “budget parallel” option

Concerns (based on what I’ve seen here)

KV cache is brutal on Gemma 4 31B“massive KV cache… biggest drawback”
Even people with large VRAM struggle with higher quants / context
24GB seems like the minimum viable tier for 31B dense
Long context scaling is still very hardware-sensitive
Multi-GPU scaling (esp PCIe) seems very inconsistent depending on backend

What I want to know

If you’ve actually run Qwen3.5 27B / Gemma 4 31B (dense):

What GPU are you using?
What real tok/s are you getting (esp @ 64K+)
Does multi-GPU actually scale well or just look good on paper?
Is 32GB single GPU > dual 16/24GB in practice?
Any regrets / “don’t buy this” advice?

Bonus question

If you had ~$1800 today, would you:

go multi-GPU AMD (cheap + raw compute)
or single high-VRAM card (simpler + better ctx)

Appreciate any real benchmarks / configs 🙏

submitted by /u/Fit-Courage5400
[link] [comments]

Black Hat USA

AI Business

Black Hat Asia

AI Business

AI traffic to US retailers rose 393% in Q1, and it’s boosting their revenue too

TechCrunch

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp