| TL;DR: I finally have this working and doing real work within the tight specs of my 32GB RAM Mac. So for those who would like to fly like Julien Chaumond, here's an updated HOW-TO, an explanation of why I did everything I did, and my personal take on how well it actually works. This is a snapshot in time. I'll keep posting revised versions as my setup improves. HOW-TO * We're going to use llama.cpp to run the model locally. But, these models are really new and bugs are constantly being fixed. So we need to build llama.cpp from source. This is easier than it sounds. If you have never done it, install the MacOS command line developer tools: Now you can build llama.cpp: * Add that * Download the model itself. I prefer to just download these directly: * Create a * Go to https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF * Click UD-IQ4_XS * Click Download * Move the downloaded file to * Go to https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/blob/main/mmproj-BF16.gguf to download the matching vision adapter * Click Download (it's there, look closer) * Move that file into * CLOSE ALL YOUR APPS except Chrome and Terminal. Yes including vscode. Close as many browser tabs as you can. For long overnight sessions, close Chrome too. Understand that Chrome uses a lot of RAM and wasted RAM is the enemy. This model just... barely... fits. * Test it: I'll explain why I used each of these options later. This will launch a simple chat interface, running entirely on your own machine. Your first query will take a long time! But as long as you don't leave it idle, later responses will start much faster. llama.cpp is designed to stand down and return resources to the system when you're not using it. * Now add aliases to your .bashrc or .zshrc so you can run either the chat interface or an OpenAI-compatible API server at any time: * Run * Start * In a new terminal window, install opencode. The quickest way to get the latest release is: Again, things are changing fast, so the latest release is a good idea. If you want to install by other means or make sure I'm not giving you weird advice, just check out the opencode site. * I think I had to manually add * Configure opencode to talk to your local model. Create I'll explain each setting later. * Now * As soon as the opencode UI comes up, CHOOSE THE RIGHT MODEL. Do NOT spend half an hour working with the free default cloud model by mistake. Not that I know anyone who did that. Um. Specifically, choose this model:
If you don't see it, you probably didn't configure * Say "hello" and wait for a response (again, the first may be very slow, later responses are faster). * You're all set! Work with THINGS THAT GO WRONG * If you forget and waste a lot of RAM on electron apps or even browser tabs, it'll be very slow, or * Once in a while it'll print some XML-flavored thinking trace and just... stop. You can prompt it to continue. This is most likely qwen flubbing the tool call and opencode not having code to gracefully recognize that flavor of response and try again. "WHY DID YOU CHOOSE THAT QUANTIZED MODEL?" Macs are incredible because they have unified RAM. Both the CPU and the GPU can see 100% of it. But, 32GB RAM is just super, super tight for these models. It's a miracle they fit at all. You simply must choose a quantized model, even though that means trading off some intelligence and accuracy. The full-size model would never fit. So first I tried Q4_K_M, which is mentioned in most guides. And that technically fit, but I didn't have enough memory left over for an adequate context size. The IQ4-XS (Extra Small) model gets us back several additional GB of RAM, and we need every one of 'em. "WHY ARE YOU USING EACH OF THOSE OPTIONS?" That command again: * * * * * * * * "WHY DO YOU USE THESE OPENCODE SETTINGS?" Most of that is clearly just pointing to the right place (the right API URL with the right port, the right model name). These settings are more interesting: limit is telling opencode what the context size is and how big a single response from qwen might be, so it can figure out when to compact the session. With a small context window, compaction is obviously mandatory, and if it doesn't happen soon enough, the session fails. I found that without setting a high value for output, the model frequently ran out of context and gave up. Setting output to 49152 solves this.
"WHY DON'T YOU JUST..." * Use Claude Code? I had problems due to a lack of optimization for small context windows. Long-running tasks that complete large projects independently matter for me, so no Claude Code. * Use pi.dev? Yeah I know: it's even better for limited context windows. And saving context is always the dream. It's next on my list. * Provide a web search tool to the agent? Also on my list. * Use GREAT! BUT... HOW GOOD IS IT? Well... I've given it two real world, fair challenges from my actual recent work. These are things that Claude Code was able to complete with Opus 4.6. And from recent experience, I think it would have worked back as far as Opus 4.5. The famous November release. The day a lot of experienced developers like me stopped typing code and started directing Claude Code instead. One is a pretty simple web app for creating greeting cards. I asked it to find an old bug I'd been too lazy to figure out. The bug had to do with a discrepancy in the positioning of images on the card between the web-based, CSS-driven editor and the pdfkit-based PDF support. The other is adding SQLite support as an alternative database backend for ApostropheCMS, which defaults to MongoDB. Now, you would think the first take would be a lot easier. But this model just can't quite wrap its head around the geometry of it. It often names the actual problem (which I know, because Opus already nailed it), but then flails wildly with the implementation. Multiple times now, it has created an implementation that causes the size of the editor to strobe vigorously between two sizes... yes it was painful (but funny). Just once, it kinda fixed it, but added an extra visible space at the bottom of the images and couldn't get rid of it. So I went on to the second problem. And that, too, was a disappoint at first. Qwen went through a similar chain of reasoning to Opus: catalog the existing uses of mongodb's Node.js API in ApostropheCMS, create an emulation with the same API. But the first implementation failed to use real JSONB operations, even though I told it to. It would fetch the entire database, then filter documents in RAM. Um... no. Qwen also flailed trying to get all of the ApostropheCMS unit tests to pass... or really any of them. It would try to trace where various properties came from, but always get stuck, and it started to modify the CMS code itself. Oh HELL no. I instructed Qwen to NEVER touch the unit tests or the application code, but only the adapter code itself, because if it passes with mongodb, it can pass with an acceptable emulation. Qwen accepted that direction but still couldn't track down the issues. Honestly the codebase was probably just too much to fathom in this limited context window, although Claude did fine with just twice as much context (256K). So I gave Qwen a hint, something Opus figured out on its own: start by writing your own test suite for the mongodb API operations, and make sure both adapters pass it. Obviously, if mongodb doesn't pass, you botched the tests themselves. And... that worked a lot better. Qwen built a real adapter using real JSONB operations. There is a decent little test suite and those tests do pass with both sqlite and real mongodb. So now I've asked it to go back to iterating on passing the actual apostrophecms tests. These are mocha tests too, but they are much closer to functional tests than unit tests because they exercise much of the system. My theory is that, now that the simple stuff has been debugged, Qwen will have more luck tracing down issues at this level of integration. Or it may just be overwhelmed. We'll see. So... is it useful? For some tasks, I'd say yes. My second task is actually a classic win for AI coding agents: the adapter pattern. "Here's a thing that works, and a huge test suite. Build a compatible thing that passes the same test suite. You're not done until the tests all pass." And I think Qwen did OK on it, eventually. It required more guidance than Claude Code, but I would still choose it over grinding out that much MongoDB-like query logic by hand. But my first task was a stumper and shows Qwen can still get stuck in thinking loops, at least at this quantization and context size (I need to be fair here). Edit: dealing with my second test at its full scale is still a challenge too. An exchange I just had, in the middle of a long autonomous run. I reiterated what I want, but I may find myself back in the same place: My next steps * Try pi. * Try providing a web search tool, for reading documentation. * Try using cloud-hosted Qwen 3.6 35B A3B, without quantization, in order to see what I could get from better but still realistic home hardware. As we watch the AI financing bubble start to shrink, my wife and I are both asking questions like "can we run this at home? If not, are there other sustainably affordable options?" It's already cool and useful that my Mac can do this. But running on a dedicated box with a little more RAM (OK, twice as much) and a stronger GPU, it might make the leap from "cool and useful" to routinely offloading some of our tasks from expensive cloud AI providers. My task is to find out if it's good enough to justify the cost... especially when cheap cloud API options like DeepSeek 4 also exist. Thanks To the many people who have replied to my past posts with advice: thanks! You did help me in the right direction. [link] [comments] |
Field report: coding with Qwen 3.6 35B-A3B on an M2 Macbook Pro with 32GB RAM
Reddit r/LocalLLaMA / 4/25/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage
Key Points
- The author reports successfully running Qwen 3.6 35B-A3B locally on an M2 MacBook Pro with 32GB RAM by using llama.cpp and carefully configuring the setup.
- They provide a practical step-by-step HOW-TO, including building llama.cpp from source on macOS, setting the PATH, and installing required command-line developer tools.
- The guide details how to download the correct GGUF model and the matching mmproj (vision adapter) from Hugging Face, then place both files into a local models directory.
- A key constraint is memory: the author recommends closing most applications (including many browser tabs and possibly Chrome) because Chrome’s RAM usage can prevent the model from fitting reliably.
- They emphasize the result is a snapshot in time, suggesting revised instructions may follow as their environment and llama.cpp improve.
Related Articles

Black Hat USA
AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

How I tracked which AI bots actually crawl my site
Dev.to

Hijacking OpenClaw with Claude
Dev.to

How I Replaced WordPress, Shopify, and Mailchimp with Cloudflare Workers
Dev.to