Hi all,
I wanted to share a setup that’s working for me with Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB VRAM) + 96GB RAM.
This is not an interactive chat setup. I’m using it as a coding subagent inside an agentic pipeline, so some of the choices below are specific to that use case.
TL;DR
- - Qwen3.6 35B A3B runs fine on 8GB VRAM + RAM as coding subagent
- - my real bug was not a crash: unlimited thinking consumed the whole max_tokens budget
- - disabling thinking fixed it
- - better fix: use per-request thinking_budget_tokens
- - open question: best n-cpu-moe split on 8GB
Hardware / runtime
- GPU: RTX 4060 Laptop, 8GB VRAM
- RAM: 96GB DDR5
- Runtime: llama-server
- Model: Qwen3.6-35B-A3B GGUF
- Use case: coding subagent / structured pipeline work
Current server command
llama-server \ -m Qwen3.6-35B-A3B-Q4_K_M.gguf \ -ngl 99 \ --n-cpu-moe 99 \ -c 50000 \ -np 1 \ -fa on \ --cache-type-k q8_0 \ --cache-type-v turbo2 \ --no-mmap \ --mlock \ --ctx-checkpoints 1 \ --cache-ram 0 \ --jinja \ --reasoning on \ --reasoning-budget -1 \ -b 2048 \ -ub 2048 PowerShell env:
$env:LLAMA_SET_ROWS = "1" $env:LLAMA_CHAT_TEMPLATE_KWARGS = '{"preserve_thinking":true}' Notes on the non-obvious choices
--n-cpu-moe 99: on 8GB VRAM, I’m currently pushing MoE layers to CPU. This is partly based on my own constraints and partly on community tuning discussions, not on official guidance.-np 1: this is a single-user / single-agent setup, so I don’t want extra slots wasting RAM.-b 2048 -ub 2048: in my tests this gave noticeably better prefill on prompts above ~2K tokens than lower defaults.LLAMA_SET_ROWS=1: community tip, easy to try, seems worth keeping.preserve_thinking: true: I’m using this because Qwen3.6 explicitly supports it, and for agent workflows it helps keep prior reasoning in cache instead of re-deriving everything every turn.
Important distinction: official vs empirical
A few things here are officially documented for Qwen3.6:
enable_thinkingpreserve_thinking- thinking mode being on by default
- recommended sampling presets for coding / thinking / non-thinking use
Other parts of this config are just my current best empirical setup or community-derived tuning, especially around MoE placement, KV config, and batch / ubatch choices.
So I’m posting this as “working setup + observations”, not as a universal best config.
The trap I ran into: thinking can eat the whole output budget
What initially looked like a weird bug turned out to be a budgeting issue.
I’m calling llama-server through the OpenAI-compatible API with chat.completions.create, and I was setting max_tokens per request.
With:
--reasoning on--reasoning-budget -1- moderately large prompts
- coding tasks that invite long internal reasoning
…the model could spend the entire output budget on thinking and return no useful visible answer.
In practice I saw cases like this:
| max_tokens | thinking | finish_reason | visible code output | elapsed |
|---|---|---|---|---|
| 6000 | ON | length | empty / unusable | ~190s |
| 10000 | ON | length | empty / unusable | ~330s |
| 5000 | OFF | stop | ~3750 tokens of clean code | ~126s |
So for some coding tasks, the model wasn’t “failing” in the classic sense. It was just burning the whole budget on reasoning.
The useful part: there is a per-request fix
I originally thought reasoning budget might only be controllable server-side.
But llama-server supports a per-request field:
{ "thinking_budget_tokens": 1500 } As I understand it, this works if you did not already fix the reasoning budget via CLI.
So the cleaner approach for my use case is probably:
- don’t hardcode a global reasoning budget if I want request-level control
- disable thinking for straightforward refactors
- use bounded thinking for tasks that genuinely benefit from it
My current rule of thumb
Right now I’m leaning toward:
| Task type | Thinking | My current view |
|---|---|---|
| Clear refactor from precise spec | OFF | better throughput, less token waste |
| Moderately ambiguous coding | ON, but bounded | probably best with request-level budget |
| Architecture / design tradeoffs | ON | worth the cost |
| Fixed-schema extraction / structured transforms | OFF | schema does most of the work |
One more thing: soft switching thinking
For Qwen3.6, I would not rely on /think or /nothink style prompting as if it were the official control surface.
The documented path is chat_template_kwargs, especially enable_thinking: false when you want non-thinking mode.
So my current plan is to switch modes that way instead of prompt-hacking it.
What I’d love feedback on
--n-cpu-moeon 8GB VRAM Has anyone found a better split than “just shove everything to CPU” on this class of hardware?-b/-ubtuning for very long prompts 2048 looks good for me so far, but I’d love data points from people pushing 50K+ context regularly.- KV config with Qwen3.6 in practice I’m using
turbo2right now based on community findings and testing. Curious what others ended up with. - Thinking policy for agentic coding If you use Qwen3.6 locally as a coding worker, when do you keep thinking on vs force it off?
Happy to share more details if useful. This is part of a local knowledge-compiler / project-memory pipeline, so I care a lot more about reliable structured output than about chat UX.
[link] [comments]




