| Two weeks ago I posted here that MLX was slower than GGUF on my M1 Max. You gave feedback, pointed out I picked possibly the worst model for MLX. Broken prompt caching (mlx-lm#903), hybrid attention MLX can't optimize, bf16 on a chip that doesn't do bf16. So I went and tested almost all of your hints and recommendations. After the fp16 conversion, most scenarios are single-digit differences. But its still not a "Just use MLX decision". Here is Qwen3 30B-A3B effective tok/s (higher is better)
Generation speed is basically tied with this model: 58 tok/s GGUF vs 55-56 MLX. The "57 vs 29" from Part 1 was the model, not the engine. Interesting: Runtimes matter more than the engine.
LM Studio adds no overhead compared to raw llama.cpp. Verified by compiling with Metal support myself. On the MLX side: oMLX is 2.2x faster than LM Studio MLX on multi-turn. But I also tested Gemma 12B, where LM Studio's caching works fine. Interestingly oMLX and LM Studio MLX produce similar numbers there. So oMLX fixes caching problems, not MLX performance in general. Still the best MLX runtime though. bf16 fix for anyone on M1/M2: Under a minute, no quality loss, recovers 40-70% of prefill penalty. M3+ has native bf16 so this doesn't apply there. What I came across during research is the MLX quant quality concern: MLX 4-bit and GGUF Q4_K_M are not the same thing despite both saying "4-bit." But there is some movement in that area. GGUF K-quants allocate more bits to sensitive layers, MLX applies uniform depth. The llama.cpp project measured a 4.7x perplexity difference between uniform Q4_0 and Q4_K_M on a 7B model. I haven't tested this myself yet. Would be interesting to see if that shows up in real output quality with the models I benchmarked. JANG-Q is working on bringing adaptive quantization to MLX. Where I landed:
Still looking for M2 and M4 data. Benchmark yourself if you feel like it Contribute results as Pull Request and I'll add your hardware or just use it to test your use-case. But there is no need to contribute. Comment with your results and findings if you happen to run something would be great**.** Now enough benchmarking and back to solving actual problems :) Thoughts on this journey? Some more tips & tricks? Also happy do discuss over the channel linked in my profile. Full writeup with all charts and some research data: famstack.dev/guides/mlx-vs-gguf-part-2-isolating-variables [link] [comments] |
GGUF (llama.cpp) vs MLX Round 2: Your feedback tested, two models, five runtimes. Ollama adds overhead. My conclusion. Thoughts?
Reddit r/LocalLLaMA / 3/26/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical Usage
Key Points
- The author revisits an earlier benchmark where MLX was slower than GGUF on an M1 Max, then reruns tests after addressing MLX issues (prompt caching limitations, attention optimizations, and bf16 incompatibility) and recompiling llama.cpp to reduce overhead concerns.
- Using two QAT/quantized models (Gemma 12B QAT and Qwen3 30B-A3B) across five runtimes, the results show that after switching bf16 to fp16, most scenarios differ only by single-digit percentages.
- For Qwen3 30B-A3B, GGUF Q4_K_M often delivers the highest effective tokens per second in tasks like creative writing, document classification, and multi-turn ops-agent prompting.
- Generation throughput between GGUF and MLX is largely comparable for the tested setup (e.g., ~58 tok/s GGUF vs ~55–56 tok/s MLX), but prefill/long-context stress still favors GGUF in the presented numbers.
- A key takeaway is that runtime choice can dominate engine-level performance differences (LM Studio vs compiled llama.cpp), and the author concludes there isn’t a simple “always use MLX” rule.
Related Articles
Speaking of VoxtralResearchVoxtral TTS: A frontier, open-weights text-to-speech model that’s fast, instantly adaptable, and produces lifelike speech for voice agents.
Mistral AI Blog
Why I Switched from Cloud AI to a Dedicated AI Box (And Why You Should Too)
Dev.to
Anyone who has any common sense knows that AI agents in marketing just don’t exist.
Dev.to
How to Use MiMo V2 API for Free in 2026: Complete Guide
Dev.to
The Agent Memory Problem Nobody Solves: A Practical Architecture for Persistent Context
Dev.to