[7900XT] Qwen3.6 27B for OpenCode

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • The post seeks guidance on the best way to set up Qwen3.6 27B for OpenCode while working with limited VRAM on an AMD Radeon RX 7900 XT.
  • The author shares a specific llama-server launch configuration (sampling params, cache settings, flash-attn, and a very large context window of 65,536) that currently uses about 18.6/20 GB of VRAM.
  • They estimate there may be room to increase VRAM usage by roughly 0.5 GB, potentially allowing slight tuning, such as context/cache-related adjustments.
  • The author compares the option of using Qwen3.6 35B, noting MoE and possible KV-cache quantization differences, but concludes it likely offers little benefit for their stated goal versus 27B.
  • Overall, the discussion is centered on practical performance/quality tuning for running Qwen-class models locally under VRAM constraints.

I'm just looking for some advice on optimally setting up Qwen3.6 27B for OpenCode. The VRAM is a little bit scarce, but I ended up with this so far:

llama-server --model models/Qwen3.6-27B-IQ4_XS.gguf \ --port 8080 \ --host 127.0.0.1 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --temperature 0.6 \ --flash-attn on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 \ --ctx-size 65536 \ --chat-template-kwargs '{"preserve_thinking": true}' \ 

With this my VRAM usage is around 18.6/20 GB. So potentially I could stretch it by about 0.5GB.

Of course there is Qwen3.6 35B that thanks to MoE can fit without KV cache quantization and in Q4_K_M or even K_XL or maybe even Q5, but I don't think for this goal it would be of benefit over 27B.

submitted by /u/Mordimer86
[link] [comments]