Qwen3.5 MLX vs GGUF Performance on Mac Studio M3 Ultra 512GB

Reddit r/LocalLLaMA / 3/18/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The post reports on testing Qwen3.5 MLX vs GGUF performance on a Mac Studio M3 Ultra and finds that prompt processing with MLX is slow for real-world tasks involving multiple files and debugging.
The author observes that MLX token generation decreases as the context window grows, whereas llama.cpp maintains stable generation with larger contexts.
The author claims that unsloth/qwen3.5 models offer much faster prompt processing than MLX at large contexts, with the gap widening as the context size increases.
They describe a fast workflow using OpenCode + llama.cpp + Qwen3.5 (35B for speed / 122B for quality) and recommend Mac users upgrade to the latest llama.cpp version to try this setup, sharing a sample llama-server invocation.

l got into LLM world not while ago and the first thing I did was to buy Mac Studio M3 Ultra with 512GB (thank god I managed to buy it before the configuration not available anymore).
soon as I got it I rushed to install OpenCode and the just-released Qwen3.5 series with all the amazing hype around it.
I ran serval real world tasks that require architecture, coding and debugging.

as a newbie, I read that MLX models are optimized for Apple silicon cheap and promised me the wonderful benefits of the silicon architecture.

disappointing point: soon as I got to work on a real world tasks, that requires multiple files, debugging sessions, MCP calls - the prompt processing became unbearably slow.
many hours of sitting in-front of the monitor, watching LM Studio server log "prompt processing %" going slowly to 100%.

this got me into a point that I honestly though local agentic coding is not realistic for Mac and that it should be run on 4 X 6000 Pro setup.

the other day I ran into reddit post saying Mac users should update llama.cpp for the qwen3.5 benefits, while I was thinking to myself "llama? why? isn't MLX best option for Mac?", well apparently not!

unsloth/qwen3.5 models prompt processing is way way better than MLX on large context and the bigger the size - the gap getting bigger.
tokens generation? unlike llama.cpp that keeps stable TG, on mlx the TG decrease with the size of the context window.

additionally: prompt cache just feels like working technology on llama.cpp, I managed to operate a working fast workflow with opencode + llama.cpp + qwen3.5 35B(for speed)/122B(quality) and it felt smooth.

why I made this post?
1. to share the findings, if you are a Mac user, you should build latest llama.cpp version and git it a try.
2. I'm a newbie and I could be completely wrong, if anyone has a correction for my situation I would love to hear your advice.

llama-server command:

./llama-server \ -m 'path to model' \ --host 127.0.0.1 \ --port 8080 \ --jinja \ -ngl all \ -np 1 \ -c 120000 \ -b 2048 \ -ub 2048 \ -t 24 \ -fa on\ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --reasoning auto \

any type of advice/information would be awesome for me and for many.

submitted by /u/BitXorBit
[link] [comments]