l got into LLM world not while ago and the first thing I did was to buy Mac Studio M3 Ultra with 512GB (thank god I managed to buy it before the configuration not available anymore).
soon as I got it I rushed to install OpenCode and the just-released Qwen3.5 series with all the amazing hype around it.
I ran serval real world tasks that require architecture, coding and debugging.
as a newbie, I read that MLX models are optimized for Apple silicon cheap and promised me the wonderful benefits of the silicon architecture.
disappointing point: soon as I got to work on a real world tasks, that requires multiple files, debugging sessions, MCP calls - the prompt processing became unbearably slow.
many hours of sitting in-front of the monitor, watching LM Studio server log "prompt processing %" going slowly to 100%.
this got me into a point that I honestly though local agentic coding is not realistic for Mac and that it should be run on 4 X 6000 Pro setup.
the other day I ran into reddit post saying Mac users should update llama.cpp for the qwen3.5 benefits, while I was thinking to myself "llama? why? isn't MLX best option for Mac?", well apparently not!
unsloth/qwen3.5 models prompt processing is way way better than MLX on large context and the bigger the size - the gap getting bigger.
tokens generation? unlike llama.cpp that keeps stable TG, on mlx the TG decrease with the size of the context window.
additionally: prompt cache just feels like working technology on llama.cpp, I managed to operate a working fast workflow with opencode + llama.cpp + qwen3.5 35B(for speed)/122B(quality) and it felt smooth.
why I made this post?
1. to share the findings, if you are a Mac user, you should build latest llama.cpp version and git it a try.
2. I'm a newbie and I could be completely wrong, if anyone has a correction for my situation I would love to hear your advice.
llama-server command:
./llama-server \ -m 'path to model' \ --host 127.0.0.1 \ --port 8080 \ --jinja \ -ngl all \ -np 1 \ -c 120000 \ -b 2048 \ -ub 2048 \ -t 24 \ -fa on\ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --reasoning auto \ any type of advice/information would be awesome for me and for many.
[link] [comments]




