[cupel] M5 Max 128GB: Qwen3.5-397B IQ2 @ 29 tokens per second

Reddit r/LocalLLaMA / 4/13/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The post claims that running the very large Qwen3.5-397B model locally on an M5 Max 128GB MacBook has become feasible by combining Unsloth’s per-model adaptive layer quantization with an “importance matrix” (imatrix) approach.
  • It details the author’s process of building up from smaller Qwen models and organizing community feedback via Gemma, culminating in a tracked issue list on GitHub.
  • The author tests the specific Unsloth “UD” quantized variant (e.g., Qwen3.5-397B-A17B-UD-IQ2_XXS), explaining that different layers receive different quantization levels, with the most important layers rounded to reduce loss/error.
  • Practical measurements are provided, including model file size expectations versus observed total sizes on disk, and a command sequence (ll/gguf-dump) to inspect the quantization recipe used inside the GGUF files.
  • The author reports that the resulting quantized model reaches about ~106GB on disk and demonstrates inspecting tensor quantization bit-widths and roles, alongside a reported throughput of ~29 tokens per second for the setup.
[cupel] M5 Max 128GB: Qwen3.5-397B IQ2 @ 29 tokens per second

A year ago I would just read about 397B league of models. Today I can run it on my laptop. The combination of importance matrix (imatrix) with Unsloth's per-model adaptive layer quantization is what makes it all possible. But I didn't start with 397B, I started with 17 smaller models..

There were a lot of great feedback from "M5 Max 128GB, 17 models, 23 prompts: Qwen 3.5 122B is still a local king" discussion.

I used Gemma 4 to organize all the feedback into actions, and Gemma and I created the list to work on to address the feedback and the asks: https://github.com/tolitius/cupel/issues/1

One of the ask was to take "Qwen3.5-397B-A17B-UD-IQ2_XXS" for a spin on the M5 Max 128G MacBook. These Unsloth ("UD") models are really interesting because different layers are quantized differently. On top of the the most important ("I") weights are rounded to minimize their loss / error.

After downloading Qwen 397B, before doing anything else I wanted to understand what it is I am going to ask my laptop to swallow:

$ ll -h ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS total 224361224 -rw-r--r-- 1 user staff 10M Apr 12 18:50 Qwen3.5-397B-A17B-UD-IQ2_XXS-00001-of-00004.gguf -rw-r--r-- 1 user staff 46G Apr 12 20:12 Qwen3.5-397B-A17B-UD-IQ2_XXS-00003-of-00004.gguf -rw-r--r-- 1 user staff 14G Apr 12 20:57 Qwen3.5-397B-A17B-UD-IQ2_XXS-00004-of-00004.gguf -rw-r--r-- 1 user staff 46G Apr 12 21:12 Qwen3.5-397B-A17B-UD-IQ2_XXS-00002-of-00004.gguf 

Now I knew it is 106GB. The original 16bit model is 807GB, if it was "just" quantized to 2bit model it would take (397B * 2 bits) / 8 = ~99 GB, but I am looking at 106GB, so I wanted to look under the hood to see the actual quanization recipe Unsloth team followed:

$ gguf-dump \ ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS/Qwen3.5-397B-A17B-UD-IQ2_XXS-00002-of-00004.gguf \ 2>&1 | head -200 
Tensor type Quant Bits Role
ffn_gate_exps IQ2_XXS ~2.06 512 routed experts gate (bulk of model)
ffn_up_exps IQ2_XXS ~2.06 512 routed experts up (bulk of model)
ffn_down_exps IQ2_S ~2.31 512 routed experts down (one step higher)
ffn_gate_shexp Q5_K 5.5 shared expert gate
ffn_up_shexp Q5_K 5.5 shared expert up
ffn_down_shexp Q6_K 6.56 shared expert down
attn_gate / attn_qkv Q5_K 5.5 GatedDeltaNet attention (linear attn layers)
attn_q / attn_k / attn_v / attn_output Q5_K 5.5 full attention layers (every 4th)
ssm_out Q6_K 6.56 GatedDeltaNet output (most sensitive)
ssm_alpha / ssm_beta Q8_0 8.0 GatedDeltaNet gates
ssm_conv1d / ssm_a / ssm_dt / ssm_norm F32 32 small tensors, kept full precision
ffn_gate_inp (router) F32 32 MoE router weights
token_embd / output Q4_K 4.5 embedding and lm_head
norms F32 32 all normalization weights

super interesting. the expert tensors (ffn_gate_exps, ffn_up_exps and ffn_down_exps) are quantized at ~2 bits, but the rest are much larger. This is where the 7GB difference (99GB vs. 106GB) really pays off: 7GB of packed intelligence on top of expert tensors.

trial by fire

By trial and error I found that 16K for the context would be a sweet spot for the 128GB unified memory. but the GPU space needs to be moved up a little to fit it (it is around 96GB by default):

$ sudo sysctl iogpu.wired_limit_mb=122880 

"llama.cpp" would be the best choice to run this model (since MLX does not quantize to IQ2_XXS):

$ llama-server \ -m ~/.llama.cpp/models/Qwen3.5-397B-A17B-UD-IQ2_XXS/UD-IQ2_XXS/Qwen3.5-397B-A17B-UD-IQ2_XXS-00001-of-00004.gguf \ --n-gpu-layers 99 \ --ctx-size 16384 \ --temp 1.0 --top-p 0.95 --top-k 20 

My current use case, as I described in the previous reddit discussion, is finding the best model assembly to help me making sense of my kids school work and progress since if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. A small army of Claude Sonnets does it well'ish, but it is really expensive, hence "Qwen3.5 397B" could be just a drop in replacement (that's the hope)

In order to make sense of which local models "do good" I used cupel: https://github.com/tolitius/cupel, and that is the next step: fire it up and test "Qwen3.5 397B" on muti-turn, tool use, etc.. tasks:

https://preview.redd.it/hoy0uqr75yug1.png?width=2476&format=png&auto=webp&s=0caab1625168f52c74244175843644a600edcf28

And, after all the tests I found "Qwen3.5 397B IQ2" to be.. amazing. Even at 2 bits, it is extremely intelligent, and is able to call tools, pass context between turns, organize very messy set of tables into clean aggregates, etc.

It is on par with "Qwen 3.5 122B 4bit", but I suspect I need to work on more exquisite prompts to distill the difference.

What surprised me the most is the 29 tokens per second average generation speed:

prompt eval time = 269.46 ms / 33 tokens ( 8.17 ms per token, 122.46 tokens per second) eval time = 79785.85 ms / 2458 tokens ( 32.46 ms per token, 30.81 tokens per second) total time = 80055.31 ms / 2491 tokens slot release: id 1 | task 7953 | stop processing: n_tokens = 2490, truncated = 0 srv update_slots: all slots are idle 

this is one of the examples from 'llama.cpp". the prompt processing depends on batching and ranged from 80 tokens per second to 330 tokens per second

The disadvantages I can see so far:

  • Can't really efficiently run it in the assembly, since it is the only model that can be loaded / fits. with 122B (65GB) I can still run more models side by side
  • I don't expect it to handle large context well due to hardware memory limitation
  • Theoretically it would have a worse time dealing with a very specialized knowledge where a specific expert is needed, but its weights are "too crushed" to give a clean answer. But, just maybe, the "I" in "IQ2-XXS" makes sure that the important weights stay very close to their original value
  • Under load I saw the speed dropping from 30 to 17 tokens per second. I suspect it is caused by the prompt cache filling up and triggering evictions, but needs more research

But.. 512 experts, 397B of stored knowledge, 17B active parameters per token and all that at 29 tokens per second on a laptop.

submitted by /u/tolitius
[link] [comments]