Gemma 4 on LocalAI: Vulkan vs ROCm

Reddit r/LocalLLaMA / 4/8/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

Read original →

共有:

Key Points

The post benchmarks three Gemma 4 variants (26B MoE with two quant formats and a 31B dense model) running on LocalAI, comparing Vulkan vs ROCm performance across multiple context lengths up to 100K tokens.
For the 26B MoE model with APEX Balanced quantization, Vulkan delivers consistently faster token generation than ROCm (roughly a 5–15% lead) at shorter/medium contexts, while the gap narrows at very long contexts.
Prompt processing behaves differently by backend: ROCm can show higher throughput at low context sizes (peaking near 4K tokens), whereas Vulkan is steadier; the results converge at larger contexts with ROCm slightly ahead at ~100K.
The benchmarks use llama-benchy with prefix caching and generation-latency mode, and report that overall both backends work well—choosing Vulkan for generation speed and ROCm for heavier long-prompt ingestion.

Gemma 4 on LocalAI: Vulkan vs ROCm

Hey everyone! 👋

Just finished running a bunch of benchmarks on the new Gemma 4 models using LocalAI and figured I'd share the results. I was curious how Vulkan and ROCm backends stack up against each other, and how the 26B MoE (only ~4B active params) compares to the full 31B dense model in practice.

Three model variants, each on both Vulkan and ROCm:

Model	Type	Quant	Source
gemma-4-26B-A4B-it-APEX	MoE (4B active)	APEX Balanced	mudler
gemma-4-26B-A4B-it	MoE (4B active)	Q5_K_XL GGUF	unsloth
gemma-4-31B-it	Dense (31B)	Q5_K_XL GGUF	unsloth

Tool: llama-benchy (via uvx), with prefix caching enabled, generation latency mode, adaptive prompts.

Context depths tested: 0, 4K, 8K, 16K, 32K, 65K, and 100K tokens.

System Environment

Lemonade Version: 10.1.0
OS: Linux-6.19.10-061910-generic (Ubuntu 25.10)
CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S
Shared GPU memory: 118.1 GB
TDP: 85W

```text vulkan : 'b8681' rocm : 'b1232' cpu : 'b8681'

```

The results

1. Gemma 4 26B-A4B — APEX Balanced (mudler)

(See charts 1 & 2)

This one's the star of the show. On token generation, Vulkan consistently beats ROCm by about 5–15%, starting around ~49 t/s at zero context and gracefully degrading to ~32 t/s at 100K. Both backends land in roughly the same place at very long contexts though — the gap closes.

Prompt processing is more interesting: ROCm actually spikes higher at low context (peaking near ~990 t/s at 4K!) but Vulkan holds steadier. They converge around 32K and beyond, with ROCm slightly ahead at 100K.

Honestly, either backend works great here. Vulkan if you care about generation speed, ROCm if you're doing a lot of long-prompt ingestion.

2. Gemma 4 26B-A4B — Q5_K_XL GGUF (unsloth)

(See charts 3 & 4)

Pretty similar story to the APEX quant, but a few t/s slower on generation (~40 t/s baseline vs ~49 for APEX). The two backends are basically neck and neck on generation once you ignore the weird Vulkan spike at 4K context (that ~170 t/s outlier is almost certainly a measurement artifact — everything around it is ~40 t/s).

On prompt processing, ROCm takes a clear lead at shorter contexts — hitting ~1075 t/s at 4K compared to Vulkan's ~900 t/s. They converge again past 32K.

3. Gemma 4 31B Dense — Q5_K_XL GGUF (unsloth)

(See charts 5 & 6)

And here's where things get... humbling. The dense 31B model is running at ~8–9 t/s on generation. That's it. Compare that to the MoE's 40–49 t/s and you really feel the difference. Every single parameter fires on every token — no free lunch.

Vulkan has a tiny edge on generation speed (~0.3–0.5 t/s faster), but it couldn't even complete the 65K and 100K context tests — likely ran out of memory or timed out.

Prompt processing is where ROCm absolutely dominates this model: ~264 t/s vs ~174 t/s at 4K context, and the gap only grows. At 32K, ROCm is doing ~153 t/s while Vulkan crawls at ~64 t/s. Not even close.

If you're running the 31B dense model, ROCm is the way to go. But honestly... maybe just run the MoE instead? 😅

	Gen Speed Winner	Prompt Processing Winner
26B MoE APEX	Vulkan (small lead)	Mixed — ROCm at low ctx
26B MoE Q5_K_XL	Basically tied	ROCm
31B Dense Q5_K_XL	Vulkan (tiny)	ROCm (by a mile)

Big picture:

🔧 Vulkan slightly favors generation, ROCm slightly favors prompt processing. Pick your priority.
📏 Past ~32K context, both backends converge — you're memory-bandwidth-bound either way.
🎯 APEX quant edges out Q5_K_XL on the MoE model (~49 vs ~40 t/s peak gen), so mudler's APEX variant is worth a look if quality holds up for your use case.
🧊 Prefix caching was on for all tests, so prompt processing numbers at higher depths may benefit from that.

For day-to-day use, the 26B-A4B MoE on Vulkan is my pick. Fast, responsive, and handles 100K context without breaking a sweat.

Benchmarks done with llama-benchy. Happy to share raw numbers if anyone wants them. Let me know if you've seen different results on your hardware!

submitted by /u/pipould
[link] [comments]