MLX has a bug that makes it slower for AWQ and GPTQ Quants

Reddit r/LocalLLaMA / 3/16/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

A bug in MLX has been identified that slows down AWQ and GPTQ quantized models, contrary to the expectation that quantization should speed them up.
The issue is discussed in a GitHub issue ml-explore/mlx#3251 and a Reddit post by user /u/PiaRedDragon, with a request to prioritize a fix.
The bug is believed to involve the 4-bit INT4 path using a 32/64 grouping mix, which contributes to the slowdown.
The author urges reaching out to Apple to prioritize the fix, highlighting the potential widespread impact on users of AWQ and GPTQ Quants.
Fixing this could restore the expected speedups for all AWQ and GPTQ quantized models.

I was investigating why I was not seeing the speed I would expect from quantized models (i.e they are smaller so should be much faster than non-quant) and found this bug report for MLX : https://github.com/ml-explore/mlx/issues/3251

If you know anyone over at Apple can you get them to prioritize this fix, it will help all AWQ and GPTQ Quants.

If you are using in models with "4-bit INT4" it likely uses the 32/64 grouping mix that this bug identified.

submitted by /u/PiaRedDragon
[link] [comments]