tl;dr - For software development, Qwen3.6 27B, 5090 gives you ~3x speed over M5 Max, letting you plow through code, while M5 Max gives you ~4x memory, letting you use higher quantization and bigger context. Which would you choose and why?
I've been doing a lot of research on this topic for a couple weeks now, but I still can't fully decide one way or another. I'm hoping to hear some other people's opinions on this, ideally from people who have used these hardware, for the type of work I plan to do.
I plan to use Qwen 3.6 27B for software development, ideally removing any reliance on cloud models other than an occasional API call to Opus/GPT if I really can't figure something out. I have tried running it on an M4 Max MBP, and it performed very well in the code that it generates. In terms of speed... Pretty bad. I asked it to implement this one feature, and it took about an hour and 20 minutes to complete it. Granted, this was with a GGUF model, llama-server without much optimization, on a massive repo that has no scaffolding, but nonetheless a very long time to sit and wait.
Now, since there'll be enough RAM to load multiple models at once, I have thought about the possibility of using 27B for an orchestrator role that will handle the high-level planning, and it spinning up a 35B A3B subagent to handle the grunt work, e.g. exploring/searching the codebase, maybe even writing code. This will speed up things for sure, and can help maintain a clean context for the main agent. But I don't know how much this will affect the overall output, since 27B is better at writing code.
M5 Max gets you way better PP speed than the M4 Max, and slightly better token generation. With newer techniques like MTP and using MLX, the speeds will be much better on the M5 Max than the M4 Max, could even approach usable speeds for agentic development but I'm not 100% sure that it does. The 128GB RAM allows me the freedom to use larger models if needed, but my main goal is code, and anything else is secondary.
However, 5090 will decimate M5 Max in speed. MTP would increase the gap even further. From my understanding, you could use KV cache offloading to simulate the orchestrator/explorer subagent context windows, effectively giving you the same thing. The only downside here is that with 32GB VRAM, you have to stick with Q4/Q5 and ~200k context (quite a bit less if you want image, which I do - being able to paste screenshots of errors is a convenience I don't want to lose). Now, people say 128k context is enough, and if so then this could be moot, but there's a mental barrier between only using 128k context for performance reasons vs. being physically unable to support it. Who knows, maybe another project will involve ingesting and using copious amounts of files, genuinely requiring bigger context windows. I just don't know.
I'll take price out of the equation, just because for the 5090 I will also have to buy some additional hardware to support it. I don't mind if it's headless and running Linux to maximize the VRAM. I also don't particularly care about the portability factor - Either device will be at home, running the LLM and available 24/7 for my other devices to remote into.
Now, I haven't tried either of these devices, and I can't easily get them to try them out. The 5090 especially, as it's final sale at all the stores around me, and an M5 Max at that spec would take weeks to ship. So I'd love to hear from those who've used either one or both of these devices - Which one would you prefer, are there any pros/cons that I'm missing, is there some missing info that will completely tilt it one way or another, etc?
Thanks for reading.
[link] [comments]



