Hey everyone,
I’ve been experimenting with running Qwen models locally on my setup:
GPU: RTX 3090 (24GB VRAM)
RAM: 64GB
CPU: Ryzen 5700X
OS: Windows 11
What I’m currently running
Qwen 3.6 35B (UD Q4_K_M)
llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -ngl 99 -c 131072 -np 2 -fa on -ctk f16 -ctv f16 -b 2048 -ub 512 -t 8 --mlock -rea on --reasoning-budget 2048 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0 Qwen 3.6 27B (UD Q4_K_XL)
llama-server.exe -m "C:\Users\Dino\.lmstudio\models\unsloth\Qwen3.6-27B-GGUF\Qwen3.6-27B-UD-Q4_K_XL.gguf" -ngl 99 -c 196608 -np 1 -fa on -ctk q8_0 -ctv q8_0 -b 2048 -ub 512 -t 8 --no-mmap -rea on --reasoning-budget -1 --reasoning-format deepseek --jinja --metrics --slots --port 8081 --host 0.0.0.0 My use case
- Hermes agent (on Raspberry Pi 5) → Reddit scraping, job scraping, basic automation
- Local coding (OpenCode / QwenCode) → small scripts, debugging, patching
- Occasional infra setup via prompts
Issues I’m facing
- 35B is too slow
- Even simple tasks take way too long to respond. Feels unusable for anything iterative.
- 27B is faster but unreliable
- Code often breaks
- Takes 20–30 mins even for simple tasks sometimes
What I’m looking for
- Better model + quant recommendations
- Something that actually works well on a 3090
- Good balance between speed + coding reliability
- Ways to improve throughput (t/s)
- Are my flags bad?
- Context size too high?
- Anything obvious I’m missing?
- Auto model loading / routing (Right now I have to):
- Kill server
- Paste new command
- Reload model
- Is there a way to:
- Auto-switch models based on request?
- Or keep multiple models warm and route between them?
What’s your stack?
Thanks in advance for any suggestions or help really appreciate it.
[link] [comments]




