I got 3× faster HFQ4 prefill on Strix Halo in hipfire with an opt-in MMQ path

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

A contributor added an experimental, opt-in MMQ-style prefill path for HFQ4-G256 in hipfire, focused on RDNA GPUs used for LLM inference.
The new approach pre-quantizes prefill activations into a Q8_1 MMQ layout and computes using i8 WMMA over 128×128 tiles with LDS staging, shifting prefill work into more GPU-friendly tiled matrix-matrix kernels.
On a Strix Halo system (gfx1151), enabling `HIPFIRE_MMQ=1` boosts longer-prefill throughput from roughly ~310–340 tok/s to about ~1140–1260 tok/s (around 3× faster).
The MMQ path is not enabled by default and is targeted at specific RDNA3/RDNA3.5 GPU targets (gfx1100–gfx1151, etc.), with benchmark results shown for Qwen3.5 9B HFQ4/MQ4.
The author is seeking independent validation from other AMD users and notes the implementation is similar in shape to llama.cpp’s AMD MMQ prompt-processing path.

I recently contributed an experimental HFQ4-G256 MMQ prefill path to hipfire, an RDNA-focused LLM inference engine.

Disclaimer: I authored the PR, so this is partly a contribution note, but I am mainly looking for independent validation from other AMD users.

Before this PR, HFQ4 prefill in hipfire was going through a more generic/slower path. On my Strix Halo system, prompt processing was clearly the bottleneck: longer prefills were around ~310–340 tok/s.

The new path adds an opt-in MMQ-style prefill implementation. In this context, MMQ means a specialized quantized matrix-multiplication path: instead of treating prefill like a less optimized sequence of operations, it packs the work into tiled matrix-matrix kernels that are better suited for GPU execution. The implementation pre-quantizes prefill activations into a Q8_1 MMQ layout and uses i8 WMMA over 128×128 output/batch tiles with LDS staging.

After enabling it with:

HIPFIRE_MMQ=1

I see longer-prefill throughput around ~1140–1260 tok/s on Strix Halo / gfx1151.

What changed:

Adds an opt-in HIPFIRE_MMQ=1 path for HFQ4-G256 prefill.
Targets RDNA3 / RDNA3.5 for now: gfx1100, gfx1101, gfx1102, gfx1103, gfx1150, gfx1151.
Pre-quantizes prefill activations into a Q8_1 MMQ layout.
Uses i8 WMMA over 128×128 output/batch tiles with LDS staging.
Similar in shape to llama.cpp’s AMD MMQ prompt-processing path.
Not enabled by default.

Benchmark: Qwen3.5 9B HFQ4/MQ4 on Strix Halo / gfx1151

KV mode	pp	MMQ off, tok/s	MMQ on, tok/s	Speedup
q8	256	363.1	1127.6	3.11x
q8	512	352.0	1179.8	3.35x
q8	1024	328.9	1222.7	3.72x
q8	2048	318.2	1168.5	3.67x
asym4	256	368.6	1108.8	3.01x
asym4	512	360.7	1173.3	3.25x
asym4	1024	333.9	1223.0	3.66x
asym4	2048	312.3	1151.7	3.69x
asym3	256	361.4	1124.5	3.11x
asym3	512	359.8	1187.3	3.30x
asym3	1024	329.9	1259.1	3.82x
asym3	2048	314.1	1216.5	3.87x
asym2	256	374.0	1116.2	2.98x
asym2	512	356.6	1173.2	3.29x
asym2	1024	340.1	1208.5	3.55x
asym2	2048	311.4	1142.9	3.67x

So on longer prefills, this moved my Strix Halo results from roughly ~311–340 tok/s to ~1143–1259 tok/s.

Correctness validation so far:

batched prefill compared against sequential token-by-token forward pass
final prefill top token match
selected-logit drift within tolerance
next decode step after prefill also checked, to catch KV-cache write problems
tested across q8, asym4, asym3, asym2 KV modes

Caveats:

validated by me mainly on one Strix Halo / gfx1151 system
the path is experimental
it is not enabled by default
I would not call this the final/canonical MMQ implementation yet
more coherence and long-context testing would be useful

The maintainer also tested the merged path on gfx1100 and reported that HIPFIRE_MMQ=1 runs cleanly there, with a smaller but still positive result: +19.8% on 4B pp256.

What I would especially like to check now is whether this implementation generalizes well across other AMD GPUs and APUs, or whether the current tuning is mostly favorable to Strix Halo / gfx1151.

The basic correctness checks pass, but I am not yet fully confident that the KV-cache behavior is completely bulletproof. Subtle KV-cache issues might only appear in longer real workloads, so I would especially appreciate validation on long-context and multi-turn runs.

I would be very interested in results from people with:

7900 XTX / gfx1100
other RDNA3 cards
Strix Halo / gfx1151
RDNA3.5 APUs
and more
long-context agentic workloads where prefill matters more than short chat decode

PR: https://github.com/Kaden-Schutt/hipfire/pull/73

submitted by /u/Own_Suspect5343
[link] [comments]

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

Dev.to

how to use skills from Claude Code A.K.A Claudinho.

Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

Most People Use AI Like Google. That's Why It Sucks.

Dev.to

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

Dev.to

I got 3× faster HFQ4 prefill on Strix Halo in hipfire with an opt-in MMQ path

Key Points

Related Articles

Write a 1,200-word blog post: "What is Generative Engine Optimization (GEO) and why SEO teams need it now"

how to use skills from Claude Code A.K.A Claudinho.

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Most People Use AI Like Google. That's Why It Sucks.

Behind the Scenes of a Self-Evolving AI: The Architecture of Tian AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer