| Disclaimer: I am fairly new to running local LLMs. But I like to know, measure and build things. So I kept seeing "use MLX on Mac, it's 2x faster" everywhere. Loaded Qwen3.5-35B-A3B to my M1 Max 64GB I bought used. Then I timed actual tasks. GGUF was faster in document classifications and not much faster in multi-turn agent conversations. That sent me down a rabbit hole. That tok/s number only measures generation (tokens produced one at a time). It ignores prefill (processing the entire input before the first token appears). Prefill scales with context size. Generation doesn't. At 8.5K tokens of context, prefill was 94% of MLX's total response time. Thats super misleading. So even though your counter says: fast. Its super slow in practice.
Table shows that prefill dominates and the effective tokens per second (the experienced tokens per second by the user) just plummets, the bigger the context. And even 8k is not that big. So the shilling 60-200 tokens per second numbers flying around are quite far away from what the end user experience is. Where MLX still wins: long output with short context. For creative, single prompt inferencing its super fast. However, in day-to-day workloads like an 8-turn agent conversation with 300-400 token replies, results swing back and forth. MLX wins most turns because the 2x generation speed compensates for slower prefill when there's enough output. GGUF takes turn 6, MLX takes turn 8. At those output lengths its basically a coin flip that depends on how much the model writes per turn. GGUF again is better, for long input prompts and shorter outputs, like my document classification use case. Did a full write up, if someone is interested. Setup: Mac Studio M1 Max, 64 GB. LM Studio 0.4.5. Qwen3.5-35B-A3B, MLX 4-bit vs GGUF Q4_K_M. Warm model, temperature 0.6, thinking mode off. I only have M1 Max data. M2 through M5 have higher memory bandwidth, which should directly improve prefill. Curious whether the gap narrows or widens on newer silicon. What am I missing? Found some tuning parameters to try out to optimize prefill (See repo). So I will give it another round with these and also compare LM Studio with Ollama with bare llama.cpp. Benchmark yourself! Would be great if we get some more numbers down the road with the scenarios I set up. [link] [comments] |
MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison.
Reddit r/LocalLLaMA / 3/13/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The article argues that raw tokens-per-second benchmarks for MLX vs llama.cpp are misleading because they ignore prefill time and context size, which dominate real-world performance.
- Prefill time scales with context length; at around 8.5k tokens, prefill accounted for about 94% of MLX's total response time, so tok/s metrics don't reflect user experience.
- The 'effective tokens per second' metric shows MLX's advantage diminishes as context grows, with measurements across 655 to 8,496 tokens revealing low tok/s and context-dependent performance.
- The author notes MLX may still have strengths on long outputs, but overall end-user experience is driven by prefill and context size, making marketing tok/s figures misleading.




