MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison.

Reddit r/LocalLLaMA / 3/13/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The article argues that raw tokens-per-second benchmarks for MLX vs llama.cpp are misleading because they ignore prefill time and context size, which dominate real-world performance.
Prefill time scales with context length; at around 8.5k tokens, prefill accounted for about 94% of MLX's total response time, so tok/s metrics don't reflect user experience.
The 'effective tokens per second' metric shows MLX's advantage diminishes as context grows, with measurements across 655 to 8,496 tokens revealing low tok/s and context-dependent performance.
The author notes MLX may still have strengths on long outputs, but overall end-user experience is driven by prefill and context size, making marketing tok/s figures misleading.

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison.

Disclaimer: I am fairly new to running local LLMs. But I like to know, measure and build things.

So I kept seeing "use MLX on Mac, it's 2x faster" everywhere. Loaded Qwen3.5-35B-A3B to my M1 Max 64GB I bought used.
LM Studio, saw 57 tok/s generation vs 29 tok/s for the same GGUF model. Seemed obvious. I expected everything to be snappy. Well ... turns out: No.

Then I timed actual tasks. GGUF was faster in document classifications and not much faster in multi-turn agent conversations. That sent me down a rabbit hole.

That tok/s number only measures generation (tokens produced one at a time). It ignores prefill (processing the entire input before the first token appears). Prefill scales with context size. Generation doesn't. At 8.5K tokens of context, prefill was 94% of MLX's total response time. Thats super misleading. So even though your counter says: fast. Its super slow in practice.
imho, the effective tokens per second is the more interesting metric: Average tokens per second from sending the message to the last token.

Context size	MLX effective	GGUF effective	What the UI shows (tok/s)
~655 tokens	13 tok/s	20 tok/s	MLX: 57, GGUF: 29
~1,453 tokens	10 tok/s	16 tok/s	MLX: 57, GGUF: 29
~3,015 tokens	6 tok/s	11 tok/s	MLX: 57, GGUF: 29
~8,496 tokens	3 tok/s	3 tok/s	MLX: 57, GGUF: 29

Table shows that prefill dominates and the effective tokens per second (the experienced tokens per second by the user) just plummets, the bigger the context. And even 8k is not that big. So the shilling 60-200 tokens per second numbers flying around are quite far away from what the end user experience is.

Where MLX still wins: long output with short context. For creative, single prompt inferencing its super fast. However, in day-to-day workloads like an 8-turn agent conversation with 300-400 token replies, results swing back and forth. MLX wins most turns because the 2x generation speed compensates for slower prefill when there's enough output. GGUF takes turn 6, MLX takes turn 8. At those output lengths its basically a coin flip that depends on how much the model writes per turn.

GGUF again is better, for long input prompts and shorter outputs, like my document classification use case.

Did a full write up, if someone is interested.

Setup: Mac Studio M1 Max, 64 GB. LM Studio 0.4.5. Qwen3.5-35B-A3B, MLX 4-bit vs GGUF Q4_K_M. Warm model, temperature 0.6, thinking mode off.
Also comparing it to Ollama now. But need a bit more time.
Also I did not test the optimzations yet. Again, this is a such a rabbit hole.

I only have M1 Max data. M2 through M5 have higher memory bandwidth, which should directly improve prefill. Curious whether the gap narrows or widens on newer silicon.

What am I missing?

Found some tuning parameters to try out to optimize prefill (See repo). So I will give it another round with these and also compare LM Studio with Ollama with bare llama.cpp.

Benchmark yourself! Would be great if we get some more numbers down the road with the scenarios I set up.
Very curious how much the newer chips fix the prefill problem.

git clone https://github.com/famstack-dev/local-llm-bench cd local-llm-bench python3 bench.py --model llama3.1:8b python3 bench.py --model qwen3.5:35b-a3b

submitted by /u/arthware
[link] [comments]

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像

Ledge.ai

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

note

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

note

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

note

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

note

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison.

Key Points

Related Articles

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

Related Articles

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表 人間・マシン・AIの資格情報を一元統制のサムネイル画像

『モンドーモンドー』｜夏目龍頭流闇文学｜AI画像生成｜自由詩｜散文詩｜ホラー｜ダークファンタジー｜深淵図書館

​報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測

「お金、見直したいけどどこから？」AIが改善ヒントを教えてくれる、公式プロンプトを公開

Copilotと物語を作ってみた #213 めーっちゃボロボロこぼす女の子の物語

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

1Password、AIエージェントのアクセス制御を統合管理する「Unified Access」発表人間・マシン・AIの資格情報を一元統制のサムネイル画像

報告：LLMにおける「自己言及的再帰」と「ステートフル・エミュレーション」の観測