Gemma 4 QAT Q4_0 Bench on Strix Halo
Reddit r/LocalLLaMA / 6/6/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The article benchmarks Google’s official Gemma 4 QAT Q4_0 GGUF models when run locally on a Strix Halo APU using llama.cpp with Vulkan/RADV.
- QAT (quantization-aware training) is used to preserve more of a model’s original behavior in a low-precision Q4 setting compared with post-training quantization alone.
- The benchmark setup details specific host hardware/software components (AMD Ryzen AI Max+ 395, 128GB unified memory, Linux Mint/Ubuntu base, Linux kernel 6.17, Mesa/RADV, and an Atomic llama.cpp TurboQuant fork) for Vulkan/RADV inference.
- Multiple QAT model variants are tested, including Gemma 4 12B, 26B-A4B, and 31B at Q4_0, and their on-disk GGUF sizes are reported.
- For “MTP assistant heads,” the author finds that using Google-matching QAT assistant sources and converting them to Atomic/llama.cpp-compatible GGUF heads yields stronger acceptance than reusing non-QAT assistant heads, with extra GGUF metadata requirements noted.
Continue reading this article on the original site.
Read original →Related Articles

Black Hat USA
AI Business

Weekend Supervised Vibe Coding
Dev.to
Generative Simulation Benchmarking for wildfire evacuation logistics networks in carbon-negative infrastructure
Dev.to

AI Automation for Ai For Handyman Businesses How To Automate Job Quote Generation And Material Lists From Client Photos: Quic...
Dev.to

Apple’s WWDC AI demos looked more real after $250M false ad settlement
TechCrunch