| Gemma 4 on LocalAI: Vulkan vs ROCmHey everyone! 👋 Just finished running a bunch of benchmarks on the new Gemma 4 models using LocalAI and figured I'd share the results. I was curious how Vulkan and ROCm backends stack up against each other, and how the 26B MoE (only ~4B active params) compares to the full 31B dense model in practice. Three model variants, each on both Vulkan and ROCm:
Tool: Context depths tested: 0, 4K, 8K, 16K, 32K, 65K, and 100K tokens. System EnvironmentLemonade Version: 10.1.0 ```text vulkan : 'b8681' rocm : 'b1232' cpu : 'b8681' ```The results1. Gemma 4 26B-A4B — APEX Balanced (mudler)(See charts 1 & 2) This one's the star of the show. On token generation, Vulkan consistently beats ROCm by about 5–15%, starting around ~49 t/s at zero context and gracefully degrading to ~32 t/s at 100K. Both backends land in roughly the same place at very long contexts though — the gap closes. Prompt processing is more interesting: ROCm actually spikes higher at low context (peaking near ~990 t/s at 4K!) but Vulkan holds steadier. They converge around 32K and beyond, with ROCm slightly ahead at 100K. Honestly, either backend works great here. Vulkan if you care about generation speed, ROCm if you're doing a lot of long-prompt ingestion. 2. Gemma 4 26B-A4B — Q5_K_XL GGUF (unsloth)(See charts 3 & 4) Pretty similar story to the APEX quant, but a few t/s slower on generation (~40 t/s baseline vs ~49 for APEX). The two backends are basically neck and neck on generation once you ignore the weird Vulkan spike at 4K context (that ~170 t/s outlier is almost certainly a measurement artifact — everything around it is ~40 t/s). On prompt processing, ROCm takes a clear lead at shorter contexts — hitting ~1075 t/s at 4K compared to Vulkan's ~900 t/s. They converge again past 32K. 3. Gemma 4 31B Dense — Q5_K_XL GGUF (unsloth)(See charts 5 & 6) And here's where things get... humbling. The dense 31B model is running at ~8–9 t/s on generation. That's it. Compare that to the MoE's 40–49 t/s and you really feel the difference. Every single parameter fires on every token — no free lunch. Vulkan has a tiny edge on generation speed (~0.3–0.5 t/s faster), but it couldn't even complete the 65K and 100K context tests — likely ran out of memory or timed out. Prompt processing is where ROCm absolutely dominates this model: ~264 t/s vs ~174 t/s at 4K context, and the gap only grows. At 32K, ROCm is doing ~153 t/s while Vulkan crawls at ~64 t/s. Not even close. If you're running the 31B dense model, ROCm is the way to go. But honestly... maybe just run the MoE instead? 😅
Big picture:
For day-to-day use, the 26B-A4B MoE on Vulkan is my pick. Fast, responsive, and handles 100K context without breaking a sweat. Benchmarks done with llama-benchy. Happy to share raw numbers if anyone wants them. Let me know if you've seen different results on your hardware! [link] [comments] |
Gemma 4 on LocalAI: Vulkan vs ROCm
Reddit r/LocalLLaMA / 4/8/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- The post benchmarks three Gemma 4 variants (26B MoE with two quant formats and a 31B dense model) running on LocalAI, comparing Vulkan vs ROCm performance across multiple context lengths up to 100K tokens.
- For the 26B MoE model with APEX Balanced quantization, Vulkan delivers consistently faster token generation than ROCm (roughly a 5–15% lead) at shorter/medium contexts, while the gap narrows at very long contexts.
- Prompt processing behaves differently by backend: ROCm can show higher throughput at low context sizes (peaking near 4K tokens), whereas Vulkan is steadier; the results converge at larger contexts with ROCm slightly ahead at ~100K.
- The benchmarks use llama-benchy with prefix caching and generation-latency mode, and report that overall both backends work well—choosing Vulkan for generation speed and ROCm for heavier long-prompt ingestion.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




