Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM

Reddit r/LocalLLaMA / 4/25/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

共有:

Key Points

The author reports successfully running Qwen 3.6 35B-A3B locally on an M2 MacBook Pro with 32GB RAM by using llama.cpp and carefully configuring the setup.
They provide a practical step-by-step HOW-TO, including building llama.cpp from source on macOS, setting the PATH, and installing required command-line developer tools.
The guide details how to download the correct GGUF model and the matching mmproj (vision adapter) from Hugging Face, then place both files into a local models directory.
A key constraint is memory: the author recommends closing most applications (including many browser tabs and possibly Chrome) because Chrome’s RAM usage can prevent the model from fitting reliably.
They emphasize the result is a snapshot in time, suggesting revised instructions may follow as their environment and llama.cpp improve.

Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM

TL;DR: I finally have this working and doing real work within the tight specs of my 32GB RAM Mac.

So for those who would like to fly like Julien Chaumond, here's an updated HOW-TO, an explanation of why I did everything I did, and my personal take on how well it actually works.

This is a snapshot in time. I'll keep posting revised versions as my setup improves.

HOW-TO

* We're going to use llama.cpp to run the model locally. But, these models are really new and bugs are constantly being fixed. So we need to build llama.cpp from source. This is easier than it sounds.

If you have never done it, install the MacOS command line developer tools:

xcode-select --install

Now you can build llama.cpp:

git clone https://github.com/ggerganov/llama.cpp cd llama.cpp cmake -B build -DCMAKE_BUILD_TYPE=Release cmake --build build --config Release -j$(sysctl -n hw.logicalcpu) export PATH="$HOME/llama.cpp/build/bin:$PATH"

* Add that export line to .bashrc or .zshrc so you have access to it every time.

* Download the model itself. I prefer to just download these directly:

* Create a models subdirectory within your home directory.

* Go to https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF

* Click UD-IQ4_XS

* Click Download

* Move the downloaded file to models

* Go to https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/mmproj-BF16.gguf to download the matching vision adapter

* Click Download (it's there, look closer)

* Move that file into models too

* CLOSE ALL YOUR APPS except Chrome and Terminal. Yes including vscode. Close as many browser tabs as you can. For long overnight sessions, close Chrome too. Understand that Chrome uses a lot of RAM and wasted RAM is the enemy. This model just... barely... fits.

* Test it:

llama-cli -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj ~/models/unsloth/mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 0.0.0.0 --port 8899

I'll explain why I used each of these options later.

This will launch a simple chat interface, running entirely on your own machine.

Your first query will take a long time! But as long as you don't leave it idle, later responses will start much faster. llama.cpp is designed to stand down and return resources to the system when you're not using it.

* Now add aliases to your .bashrc or .zshrc so you can run either the chat interface or an OpenAI-compatible API server at any time:

alias qwen-server='llama-server -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj ~/models/unsloth/mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 0.0.0.0 --port 8899' alias qwen-chat='llama-cli -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj ~/models/unsloth/mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 0.0.0.0 --port 8899'

* Run source ~/.bashrc or open a new terminal so we can start using these aliases now.

* Start qwen-server.

* In a new terminal window, install opencode. The quickest way to get the latest release is:

curl -fsSL https://opencode.ai/install | bash

Again, things are changing fast, so the latest release is a good idea. If you want to install by other means or make sure I'm not giving you weird advice, just check out the opencode site.

* I think I had to manually add opencode to your PATH by adding this line to .bashrc or .zshrc:

export PATH=/Users/boutell/.opencode/bin:$PATH

* Configure opencode to talk to your local model.

Create ~/.config/opencode/opencode.json and populate it:

{ "$schema": "https://opencode.ai/config.json", "tools": { "task": false }, "provider": { "llama.cpp": { "npm": "@ai-sdk/openai-compatible", "name": "llama-server (local)", "options": { "baseURL": "http://127.0.0.1:8899/v1" }, "models": { "Qwen3.6-35B-A3B-UD-IQ4_XS": { "name": "Qwen3.6-35B-A3B-UD-IQ4_XS", "limit": { "context": 131072, "output": 49152 }, "attachment": true, "modalities": { "input": ["text", "image"], "output": ["text"] } } } } } }

I'll explain each setting later.

* Now cd into one of your projects and run opencode:

opencode

* As soon as the opencode UI comes up, CHOOSE THE RIGHT MODEL. Do NOT spend half an hour working with the free default cloud model by mistake. Not that I know anyone who did that. Um.

Specifically, choose this model:

Qwen3.6-35B-A3B-UD-IQ4_XS

If you don't see it, you probably didn't configure opencode.json correctly.

* Say "hello" and wait for a response (again, the first may be very slow, later responses are faster).

* You're all set! Work with opencode much as you would with Claude Code.

THINGS THAT GO WRONG

* If you forget and waste a lot of RAM on electron apps or even browser tabs, it'll be very slow, or llama-server will crash with out of memory errors.

* Once in a while it'll print some XML-flavored thinking trace and just... stop. You can prompt it to continue. This is most likely qwen flubbing the tool call and opencode not having code to gracefully recognize that flavor of response and try again.

"WHY DID YOU CHOOSE THAT QUANTIZED MODEL?"

Macs are incredible because they have unified RAM. Both the CPU and the GPU can see 100% of it. But, 32GB RAM is just super, super tight for these models. It's a miracle they fit at all. You simply must choose a quantized model, even though that means trading off some intelligence and accuracy.

The full-size model would never fit. So first I tried Q4_K_M, which is mentioned in most guides. And that technically fit, but I didn't have enough memory left over for an adequate context size.

The IQ4-XS (Extra Small) model gets us back several additional GB of RAM, and we need every one of 'em.

"WHY ARE YOU USING EACH OF THOSE OPTIONS?"

That command again:

llama-server -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-IQ4_XS.gguf --mmproj ~/models/unsloth/mmproj-BF16.gguf -c 131072 --batch-size 256 -ngl 99 -np 1 --host 127.0.0.1 --port 8899

* -m picks the model, of course.

* --mmproj picks the "vision projector" file. You need this if you want to be able to paste screenshots into opencode. With this feature opencode can also potentially take screenshots with playwright and look at them to debug issues.

* -c 131072 sets the context size to 128K. This model goes up to 256K, but memory is just too tight on this machine for that. However, Qwen says you shouldn't go below 128K or the model will get confused. So that is my compromise.

* --batch-size 256 helps limit the system requirements for vision. You can skip it if you leave out --mmproj and the projector file.

* -ngl 99 loads all model layers into VRAM (unified RAM, in the case of a Mac) for best performance.

* -np 1 ensures llama.cpp doesn't try to handle more than one request simultaneously. It will queue them instead. This is important when memory and context are both tight. You might experiment with "-np 2" but I wouldn't go higher.

* --host 127.0.0.1 allows connections only from your own computer.

* --port 8899 selects a port not usually taken by some other service. Just make sure opencode.json matches.

"WHY DO YOU USE THESE OPENCODE SETTINGS?"

Most of that is clearly just pointing to the right place (the right API URL with the right port, the right model name).

These settings are more interesting:

 "limit": { "context": 131072, "output": 49152 }, "attachment": true, "modalities": { "input": ["text", "image"], "output": ["text"] }

limit is telling opencode what the context size is and how big a single response from qwen might be, so it can figure out when to compact the session. With a small context window, compaction is obviously mandatory, and if it doesn't happen soon enough, the session fails. I found that without setting a high value for output, the model frequently ran out of context and gave up. Setting output to 49152 solves this.

attachment and modalities are just declaring what this model supports. Without these, plus the mmproj option, opencode won't be able to read your pasted screenshots or look at images created by playwright during testing. If you don't care about image support, you can skip these.

"WHY DON'T YOU JUST..."

* Use Claude Code? I had problems due to a lack of optimization for small context windows. Long-running tasks that complete large projects independently matter for me, so no Claude Code.

* Use pi.dev? Yeah I know: it's even better for limited context windows. And saving context is always the dream. It's next on my list.

* Provide a web search tool to the agent? Also on my list.

* Use mlx? The gap between llama.cpp and mlx is getting pretty small, especially if you only have an M2. Also things tend to get solved for mlx later, and I'm working with qwen 3.6 which is very new. It might be a little faster but it won't solve any fundamental problems for me.

GREAT! BUT... HOW GOOD IS IT?

Well...

I've given it two real world, fair challenges from my actual recent work. These are things that Claude Code was able to complete with Opus 4.6. And from recent experience, I think it would have worked back as far as Opus 4.5. The famous November release. The day a lot of experienced developers like me stopped typing code and started directing Claude Code instead.

One is a pretty simple web app for creating greeting cards. I asked it to find an old bug I'd been too lazy to figure out. The bug had to do with a discrepancy in the positioning of images on the card between the web-based, CSS-driven editor and the pdfkit-based PDF support.

The other is adding SQLite support as an alternative database backend for ApostropheCMS, which defaults to MongoDB.

Now, you would think the first take would be a lot easier. But this model just can't quite wrap its head around the geometry of it. It often names the actual problem (which I know, because Opus already nailed it), but then flails wildly with the implementation. Multiple times now, it has created an implementation that causes the size of the editor to strobe vigorously between two sizes... yes it was painful (but funny). Just once, it kinda fixed it, but added an extra visible space at the bottom of the images and couldn't get rid of it.

So I went on to the second problem. And that, too, was a disappoint at first.

Qwen went through a similar chain of reasoning to Opus: catalog the existing uses of mongodb's Node.js API in ApostropheCMS, create an emulation with the same API.

But the first implementation failed to use real JSONB operations, even though I told it to. It would fetch the entire database, then filter documents in RAM. Um... no.

Qwen also flailed trying to get all of the ApostropheCMS unit tests to pass... or really any of them. It would try to trace where various properties came from, but always get stuck, and it started to modify the CMS code itself. Oh HELL no.

I instructed Qwen to NEVER touch the unit tests or the application code, but only the adapter code itself, because if it passes with mongodb, it can pass with an acceptable emulation. Qwen accepted that direction but still couldn't track down the issues.

Honestly the codebase was probably just too much to fathom in this limited context window, although Claude did fine with just twice as much context (256K).

So I gave Qwen a hint, something Opus figured out on its own: start by writing your own test suite for the mongodb API operations, and make sure both adapters pass it. Obviously, if mongodb doesn't pass, you botched the tests themselves.

And... that worked a lot better. Qwen built a real adapter using real JSONB operations. There is a decent little test suite and those tests do pass with both sqlite and real mongodb.

So now I've asked it to go back to iterating on passing the actual apostrophecms tests. These are mocha tests too, but they are much closer to functional tests than unit tests because they exercise much of the system. My theory is that, now that the simple stuff has been debugged, Qwen will have more luck tracing down issues at this level of integration.

Or it may just be overwhelmed. We'll see.

So... is it useful?

For some tasks, I'd say yes.

My second task is actually a classic win for AI coding agents: the adapter pattern. "Here's a thing that works, and a huge test suite. Build a compatible thing that passes the same test suite. You're not done until the tests all pass."

And I think Qwen did OK on it, eventually. It required more guidance than Claude Code, but I would still choose it over grinding out that much MongoDB-like query logic by hand.

But my first task was a stumper and shows Qwen can still get stuck in thinking loops, at least at this quantization and context size (I need to be fair here).

Edit: dealing with my second test at its full scale is still a challenge too. An exchange I just had, in the middle of a long autonomous run. I reiterated what I want, but I may find myself back in the same place:

https://preview.redd.it/6jkn4u8okcxg1.png?width=2032&format=png&auto=webp&s=1a9b8e6d56195c41fab2bfbb78b79d71ebfdccb6

My next steps

* Try pi.

* Try providing a web search tool, for reading documentation.

* Try using cloud-hosted Qwen 3.6 35B A3B, without quantization, in order to see what I could get from better but still realistic home hardware.

As we watch the AI financing bubble start to shrink, my wife and I are both asking questions like "can we run this at home? If not, are there other sustainably affordable options?"

It's already cool and useful that my Mac can do this. But running on a dedicated box with a little more RAM (OK, twice as much) and a stronger GPU, it might make the leap from "cool and useful" to routinely offloading some of our tasks from expensive cloud AI providers. My task is to find out if it's good enough to justify the cost... especially when cheap cloud API options like DeepSeek 4 also exist.

Thanks

To the many people who have replied to my past posts with advice: thanks! You did help me in the right direction.

submitted by /u/boutell
[link] [comments]