Gemma 4 QAT Q4_0 Bench on Strix Halo

Reddit r/LocalLLaMA / 6/6/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The article benchmarks Google’s official Gemma 4 QAT Q4_0 GGUF models when run locally on a Strix Halo APU using llama.cpp with Vulkan/RADV.
  • QAT (quantization-aware training) is used to preserve more of a model’s original behavior in a low-precision Q4 setting compared with post-training quantization alone.
  • The benchmark setup details specific host hardware/software components (AMD Ryzen AI Max+ 395, 128GB unified memory, Linux Mint/Ubuntu base, Linux kernel 6.17, Mesa/RADV, and an Atomic llama.cpp TurboQuant fork) for Vulkan/RADV inference.
  • Multiple QAT model variants are tested, including Gemma 4 12B, 26B-A4B, and 31B at Q4_0, and their on-disk GGUF sizes are reported.
  • For “MTP assistant heads,” the author finds that using Google-matching QAT assistant sources and converting them to Atomic/llama.cpp-compatible GGUF heads yields stronger acceptance than reusing non-QAT assistant heads, with extra GGUF metadata requirements noted.

Continue reading this article on the original site.

Read original →