I'm just looking for some advice on optimally setting up Qwen3.6 27B for OpenCode. The VRAM is a little bit scarce, but I ended up with this so far:
llama-server --model models/Qwen3.6-27B-IQ4_XS.gguf \ --port 8080 \ --host 127.0.0.1 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --temperature 0.6 \ --flash-attn on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --presence-penalty 0.0 \ --repeat-penalty 1.0 \ --ctx-size 65536 \ --chat-template-kwargs '{"preserve_thinking": true}' \ With this my VRAM usage is around 18.6/20 GB. So potentially I could stretch it by about 0.5GB.
Of course there is Qwen3.6 35B that thanks to MoE can fit without KV cache quantization and in Q4_K_M or even K_XL or maybe even Q5, but I don't think for this goal it would be of benefit over 27B.
[link] [comments]



