Struggling with Qwen3.6 27B / 35B locally (3090) slow responses, breaking code looking for better setup + auto model switching

Reddit r/LocalLLaMA / 5/5/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Key Points

  • A Reddit user with an RTX 3090 (24GB VRAM) on Windows 11 is testing Qwen 3.6 35B and 27B locally and reports severe usability issues: the 35B model is too slow for iterative work, while the 27B model is faster but often produces broken code.
  • They run the models via llama-server.exe (LM Studio/Llama.cpp-style) with different quantization and GPU offload settings, and they suspect configuration flags, quant choices, and/or context length are contributing to poor latency and reliability.
  • The user wants recommendations for better model+quant setups that run well on a 3090, specifically balancing response speed with coding accuracy/reliability.
  • They ask how to improve throughput (tokens/second), questioning whether their command-line flags are inappropriate and whether context size is set too high.
  • They also request options for “auto” model loading/routing—either switching models automatically per request or keeping multiple models warm to route between them without restarting the server.

Hey everyone,

I’ve been experimenting with running Qwen models locally on my setup:

GPU: RTX 3090 (24GB VRAM)

RAM: 64GB

CPU: Ryzen 5700X

OS: Windows 11

What I’m currently running

Qwen 3.6 35B (UD Q4_K_M)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0 

Qwen 3.6 27B (UD Q4_K_XL)

llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0 

My use case

  • Hermes agent (on Raspberry Pi 5) → Reddit scraping, job scraping, basic automation
  • Local coding (OpenCode / QwenCode) → small scripts, debugging, patching
  • Occasional infra setup via prompts

Issues I’m facing

  • 35B is too slow
    • Even simple tasks take way too long to respond. Feels unusable for anything iterative.
  • 27B is faster but unreliable
    • Code often breaks
    • Takes 20–30 mins even for simple tasks sometimes

What I’m looking for

  1. Better model + quant recommendations
    • Something that actually works well on a 3090
    • Good balance between speed + coding reliability
  2. Ways to improve throughput (t/s)
    • Are my flags bad?
    • Context size too high?
    • Anything obvious I’m missing?
  3. Auto model loading / routing (Right now I have to):
    • Kill server
    • Paste new command
    • Reload model
  • Is there a way to:
    • Auto-switch models based on request?
    • Or keep multiple models warm and route between them?

What’s your stack?

Thanks in advance for any suggestions or help really appreciate it.

submitted by /u/Clean_Initial_9618
[link] [comments]