I was investigating why I was not seeing the speed I would expect from quantized models (i.e they are smaller so should be much faster than non-quant) and found this bug report for MLX : https://github.com/ml-explore/mlx/issues/3251
If you know anyone over at Apple can you get them to prioritize this fix, it will help all AWQ and GPTQ Quants.
If you are using in models with "4-bit INT4" it likely uses the 32/64 grouping mix that this bug identified.
[link] [comments]




