Is anyone getting real coding work done with Qwen3.6-35B-A3B-UD-Q4_K_M on a 32GB Mac in opencode, claude code or similar?

Reddit r/LocalLLaMA / 4/20/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

A user reports running the Qwen3.6-35B-A3B-UD-Q4_K_M model on a 32GB M2 Mac using llama.cpp and opencode, and has to cap the context window at 32,768 tokens to avoid llama-server crashes from memory exhaustion.
In a coding/debugging test previously completed by Claude Code on a different model, the system identifies the core bug but fails to progress to reliable implementation because compaction discards too much needed information.
Disabling subagents helps it survive the first compaction pass, but during the second compaction pass it derails, mis-summarizes the prompt, and even misremembers details like the current working directory name.
The user concludes that while Qwen may be relatively efficient for RAM, the practical context limit on this setup is not sufficient for dependable real coding work, suggesting they need a more powerful machine for better results.
They ask whether others have achieved better coding outcomes under similar constraints and share their local configuration details for the model server and opencode setup.

I'm running Qwen3.6-35B-A3B-UD-Q4_K_M on an M2 Macbook Pro with 32GB of RAM. I'm using quite recent builds of llama.cpp and opencode.

To avoid llama-server crashing outright due to memory exhaustion, I have to set the context window to 32768 tokens. This turns out to be important.

As a hopefully reasonable test, I gave opencode a task that Claude Code was previously able to complete with Opus 4.7. The project isn't huge, but the task involves rooting around the front and back end of an application and figuring out a problem that did not jump out at me either (and I was the original developer, pre-AI).

The results are really tantalizing: I can see it has figured out the essentials of the bug. But before it can move on to implementation, compaction always seems to throw out way too much info.

If I disable the use of subagents, it usually survives the first compaction pass with its task somewhat intact, because I'm paying for one context, not two.

But when I get to the second compaction pass, it pretty much always loses its mind. The summary boils down to my original prompt, and it even misremembers the current working directory name (!), coming up with a variant of it that of course doesn't exist. After that it's effectively game over.

After reading a lot about how Qwen is actually better than most models with regard to RAM requirements, and most smaller models can't really code competently, I've come to the conclusion that (1) 32768 is the biggest context I can get away with in an adequately smart model, and (2) it just ain't enough. If I want to play this game, I need a more powerful rig.

Has anyone had better results under these or very similar constraints?

(Disclaimer: I'm not hating on Qwen, or Macs, or OpenCode. It's remarkable this stuff runs on my Mac at all. But I'd love to see it be just a little more useful in practice.)

Thanks!

Edit:

Here is my configuration.

My qwen-server alias:

alias qwen-server='llama-server -m ~/models/unsloth/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf -c 32768 -ngl 99 --host 0.0.0.0 --port 8080'

My opencode config:

{ "$schema": "https://opencode.ai/config.json", "tools": { "task": false }, "provider": { "llama.cpp": { "npm": "@ai-sdk/openai-compatible", "name": "llama-server (local)", "options": { "baseURL": "http://127.0.0.1:8080/v1" }, "models": { "Qwen3.6-35B-A3B-UD-Q4_K_M": { "name": "Qwen3.6-35B-A3B-UD-Q4_K_M" } } } } }

M2 Macbook Pro, 32GB RAM.

Edit: Claude points out the official model card for this model says, "The model has a default context length of 262,144 tokens. If you encounter out-of-memory (OOM) errors, consider reducing the context window. However, because Qwen3.6 leverages extended context for complex tasks, we advise maintaining a context length of at least 128K tokens to preserve thinking capabilities."

So it's kinda right there on the label, "must be this tall to ride this ride." Maybe that's my answer.

(I also tried k:v cache quantization with -ctk q8_0 -ctv q8_0, but this leads immediately to opencode not even being able to remember the current directory name accurately. Seriously, it starts misspelling it right away)

submitted by /u/boutell
[link] [comments]