Gemma 4 QAT Q4_0 Bench on Strix Halo

Reddit r/LocalLLaMA / 6/6/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article benchmarks Google’s official Gemma 4 QAT Q4_0 GGUF models when run locally on a Strix Halo APU using llama.cpp with Vulkan/RADV.
QAT (quantization-aware training) is used to preserve more of a model’s original behavior in a low-precision Q4 setting compared with post-training quantization alone.
The benchmark setup details specific host hardware/software components (AMD Ryzen AI Max+ 395, 128GB unified memory, Linux Mint/Ubuntu base, Linux kernel 6.17, Mesa/RADV, and an Atomic llama.cpp TurboQuant fork) for Vulkan/RADV inference.
Multiple QAT model variants are tested, including Gemma 4 12B, 26B-A4B, and 31B at Q4_0, and their on-disk GGUF sizes are reported.
For “MTP assistant heads,” the author finds that using Google-matching QAT assistant sources and converting them to Atomic/llama.cpp-compatible GGUF heads yields stronger acceptance than reusing non-QAT assistant heads, with extra GGUF metadata requirements noted.

Continue reading this article on the original site.