| A year ago I would just read about 397B league of models. Today I can run it on my laptop. The combination of importance matrix (imatrix) with Unsloth's per-model adaptive layer quantization is what makes it all possible. But I didn't start with 397B, I started with 17 smaller models.. There were a lot of great feedback from "M5 Max 128GB, 17 models, 23 prompts: Qwen 3.5 122B is still a local king" discussion. I used Gemma 4 to organize all the feedback into actions, and Gemma and I created the list to work on to address the feedback and the asks: https://github.com/tolitius/cupel/issues/1 One of the ask was to take " After downloading Qwen 397B, before doing anything else I wanted to understand what it is I am going to ask my laptop to swallow: Now I knew it is 106GB. The original 16bit model is 807GB, if it was "just" quantized to 2bit model it would take (397B * 2 bits) / 8 = ~99 GB, but I am looking at 106GB, so I wanted to look under the hood to see the actual quanization recipe Unsloth team followed:
super interesting. the expert tensors ( trial by fireBy trial and error I found that 16K for the context would be a sweet spot for the 128GB unified memory. but the GPU space needs to be moved up a little to fit it (it is around 96GB by default): " My current use case, as I described in the previous reddit discussion, is finding the best model assembly to help me making sense of my kids school work and progress since if anything is super messy in terms of organization, variety of disconnected systems where the kids data lives, communication inconsistencies, that would be US public schools. A small army of Claude Sonnets does it well'ish, but it is really expensive, hence " In order to make sense of which local models "do good" I used cupel: https://github.com/tolitius/cupel, and that is the next step: fire it up and test " And, after all the tests I found " It is on par with " What surprised me the most is the 29 tokens per second average generation speed: this is one of the examples from ' The disadvantages I can see so far:
But.. 512 experts, 397B of stored knowledge, 17B active parameters per token and all that at 29 tokens per second on a laptop. [link] [comments] |
[cupel] M5 Max 128GB: Qwen3.5-397B IQ2 @ 29 tokens per second
Reddit r/LocalLLaMA / 4/13/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The post claims that running the very large Qwen3.5-397B model locally on an M5 Max 128GB MacBook has become feasible by combining Unsloth’s per-model adaptive layer quantization with an “importance matrix” (imatrix) approach.
- It details the author’s process of building up from smaller Qwen models and organizing community feedback via Gemma, culminating in a tracked issue list on GitHub.
- The author tests the specific Unsloth “UD” quantized variant (e.g., Qwen3.5-397B-A17B-UD-IQ2_XXS), explaining that different layers receive different quantization levels, with the most important layers rounded to reduce loss/error.
- Practical measurements are provided, including model file size expectations versus observed total sizes on disk, and a command sequence (ll/gguf-dump) to inspect the quantization recipe used inside the GGUF files.
- The author reports that the resulting quantized model reaches about ~106GB on disk and demonstrates inspecting tensor quantization bit-widths and roles, alongside a reported throughput of ~29 tokens per second for the setup.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Agentic coding at enterprise scale demands spec-driven development
VentureBeat

How to build effective reward functions with AWS Lambda for Amazon Nova model customization
Amazon AWS AI Blog

How 25 Students Went from Idea to Deployed App in 2 Hours with Google Antigravity
Dev.to