Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into

Reddit r/LocalLLaMA / 4/21/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • A user shares a llama-server configuration that allows running Qwen3.6-35B-A3B (GGUF) as a coding subagent on a laptop with an RTX 4060 (8GB VRAM) and 96GB RAM.
  • They report that the main issue they encountered was not a crash, but “thinking” consuming the entire max_tokens budget; disabling thinking resolves the problem.
  • A more targeted fix is to use a per-request thinking_budget_tokens setting rather than relying on global thinking behavior.
  • The setup includes several non-obvious parameters (notably aggressive MoE layer placement to CPU via --n-cpu-moe 99, tuned batch/ubatch with -b/-ub at 2048, and preserve_thinking enabled) that improved prefill for longer prompts in their tests.
  • They end with an open question about the best n-cpu-moe split for 8GB VRAM, emphasizing that many choices are empirical/community-tuned rather than official guidance.
  • Point 6

Hi all,

I wanted to share a setup that’s working for me with Qwen3.6-35B-A3B on a laptop RTX 4060 (8GB VRAM) + 96GB RAM.

This is not an interactive chat setup. I’m using it as a coding subagent inside an agentic pipeline, so some of the choices below are specific to that use case.

TL;DR

  • - Qwen3.6 35B A3B runs fine on 8GB VRAM + RAM as coding subagent
  • - my real bug was not a crash: unlimited thinking consumed the whole max_tokens budget
  • - disabling thinking fixed it
  • - better fix: use per-request thinking_budget_tokens
  • - open question: best n-cpu-moe split on 8GB

Hardware / runtime

  • GPU: RTX 4060 Laptop, 8GB VRAM
  • RAM: 96GB DDR5
  • Runtime: llama-server
  • Model: Qwen3.6-35B-A3B GGUF
  • Use case: coding subagent / structured pipeline work

Current server command

llama-server \ -m Qwen3.6-35B-A3B-Q4_K_M.gguf \ -ngl 99 \ --n-cpu-moe 99 \ -c 50000 \ -np 1 \ -fa on \ --cache-type-k q8_0 \ --cache-type-v turbo2 \ --no-mmap \ --mlock \ --ctx-checkpoints 1 \ --cache-ram 0 \ --jinja \ --reasoning on \ --reasoning-budget -1 \ -b 2048 \ -ub 2048 

PowerShell env:

$env:LLAMA_SET_ROWS = "1" $env:LLAMA_CHAT_TEMPLATE_KWARGS = '{"preserve_thinking":true}' 

Notes on the non-obvious choices

  • --n-cpu-moe 99: on 8GB VRAM, I’m currently pushing MoE layers to CPU. This is partly based on my own constraints and partly on community tuning discussions, not on official guidance.
  • -np 1: this is a single-user / single-agent setup, so I don’t want extra slots wasting RAM.
  • -b 2048 -ub 2048: in my tests this gave noticeably better prefill on prompts above ~2K tokens than lower defaults.
  • LLAMA_SET_ROWS=1: community tip, easy to try, seems worth keeping.
  • preserve_thinking: true: I’m using this because Qwen3.6 explicitly supports it, and for agent workflows it helps keep prior reasoning in cache instead of re-deriving everything every turn.

Important distinction: official vs empirical

A few things here are officially documented for Qwen3.6:

  • enable_thinking
  • preserve_thinking
  • thinking mode being on by default
  • recommended sampling presets for coding / thinking / non-thinking use

Other parts of this config are just my current best empirical setup or community-derived tuning, especially around MoE placement, KV config, and batch / ubatch choices.

So I’m posting this as “working setup + observations”, not as a universal best config.

The trap I ran into: thinking can eat the whole output budget

What initially looked like a weird bug turned out to be a budgeting issue.

I’m calling llama-server through the OpenAI-compatible API with chat.completions.create, and I was setting max_tokens per request.

With:

  • --reasoning on
  • --reasoning-budget -1
  • moderately large prompts
  • coding tasks that invite long internal reasoning

…the model could spend the entire output budget on thinking and return no useful visible answer.

In practice I saw cases like this:

max_tokens thinking finish_reason visible code output elapsed
6000 ON length empty / unusable ~190s
10000 ON length empty / unusable ~330s
5000 OFF stop ~3750 tokens of clean code ~126s

So for some coding tasks, the model wasn’t “failing” in the classic sense. It was just burning the whole budget on reasoning.

The useful part: there is a per-request fix

I originally thought reasoning budget might only be controllable server-side.

But llama-server supports a per-request field:

{ "thinking_budget_tokens": 1500 } 

As I understand it, this works if you did not already fix the reasoning budget via CLI.

So the cleaner approach for my use case is probably:

  • don’t hardcode a global reasoning budget if I want request-level control
  • disable thinking for straightforward refactors
  • use bounded thinking for tasks that genuinely benefit from it

My current rule of thumb

Right now I’m leaning toward:

Task type Thinking My current view
Clear refactor from precise spec OFF better throughput, less token waste
Moderately ambiguous coding ON, but bounded probably best with request-level budget
Architecture / design tradeoffs ON worth the cost
Fixed-schema extraction / structured transforms OFF schema does most of the work

One more thing: soft switching thinking

For Qwen3.6, I would not rely on /think or /nothink style prompting as if it were the official control surface.

The documented path is chat_template_kwargs, especially enable_thinking: false when you want non-thinking mode.

So my current plan is to switch modes that way instead of prompt-hacking it.

What I’d love feedback on

  1. --n-cpu-moe on 8GB VRAM Has anyone found a better split than “just shove everything to CPU” on this class of hardware?
  2. -b / -ub tuning for very long prompts 2048 looks good for me so far, but I’d love data points from people pushing 50K+ context regularly.
  3. KV config with Qwen3.6 in practice I’m using turbo2 right now based on community findings and testing. Curious what others ended up with.
  4. Thinking policy for agentic coding If you use Qwen3.6 locally as a coding worker, when do you keep thinking on vs force it off?

Happy to share more details if useful. This is part of a local knowledge-compiler / project-memory pipeline, so I care a lot more about reliable structured output than about chat UX.

submitted by /u/Antonio_Sammarzano
[link] [comments]