AI Navigate

Qwen3.5 MLX vs GGUF Performance on Mac Studio M3 Ultra 512GB

Reddit r/LocalLLaMA / 3/18/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post reports on testing Qwen3.5 MLX vs GGUF performance on a Mac Studio M3 Ultra and finds that prompt processing with MLX is slow for real-world tasks involving multiple files and debugging.
  • The author observes that MLX token generation decreases as the context window grows, whereas llama.cpp maintains stable generation with larger contexts.
  • The author claims that unsloth/qwen3.5 models offer much faster prompt processing than MLX at large contexts, with the gap widening as the context size increases.
  • They describe a fast workflow using OpenCode + llama.cpp + Qwen3.5 (35B for speed / 122B for quality) and recommend Mac users upgrade to the latest llama.cpp version to try this setup, sharing a sample llama-server invocation.

l got into LLM world not while ago and the first thing I did was to buy Mac Studio M3 Ultra with 512GB (thank god I managed to buy it before the configuration not available anymore).
soon as I got it I rushed to install OpenCode and the just-released Qwen3.5 series with all the amazing hype around it.
I ran serval real world tasks that require architecture, coding and debugging.

as a newbie, I read that MLX models are optimized for Apple silicon cheap and promised me the wonderful benefits of the silicon architecture.

disappointing point: soon as I got to work on a real world tasks, that requires multiple files, debugging sessions, MCP calls - the prompt processing became unbearably slow.
many hours of sitting in-front of the monitor, watching LM Studio server log "prompt processing %" going slowly to 100%.

this got me into a point that I honestly though local agentic coding is not realistic for Mac and that it should be run on 4 X 6000 Pro setup.

the other day I ran into reddit post saying Mac users should update llama.cpp for the qwen3.5 benefits, while I was thinking to myself "llama? why? isn't MLX best option for Mac?", well apparently not!

unsloth/qwen3.5 models prompt processing is way way better than MLX on large context and the bigger the size - the gap getting bigger.
tokens generation? unlike llama.cpp that keeps stable TG, on mlx the TG decrease with the size of the context window.

additionally: prompt cache just feels like working technology on llama.cpp, I managed to operate a working fast workflow with opencode + llama.cpp + qwen3.5 35B(for speed)/122B(quality) and it felt smooth.

why I made this post?
1. to share the findings, if you are a Mac user, you should build latest llama.cpp version and git it a try.
2. I'm a newbie and I could be completely wrong, if anyone has a correction for my situation I would love to hear your advice.

llama-server command:

./llama-server \ -m 'path to model' \ --host 127.0.0.1 \ --port 8080 \ --jinja \ -ngl all \ -np 1 \ -c 120000 \ -b 2048 \ -ub 2048 \ -t 24 \ -fa on\ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --reasoning auto \ 

any type of advice/information would be awesome for me and for many.

submitted by /u/BitXorBit
[link] [comments]