Been running local LLMs on a Strix Halo setup (Ryzen AI MAX+ 395, 128GB RAM, 96 GiB shared GPU memory via Vulkan/RADV) under Proxmox with LXC containers and llama-server. Wanted to share where I landed after way too much benchmarking.
THE OLD SETUP (3 text models)
- GLM-4.7-Flash: 30B MoE 3B active, 18GB, 72 tok/s — daily driver, email
- Qwen3.5-35B-A3B: 35B MoE 3B active, 20GB, 55 tok/s — reasoning/coding
- Qwen3-VL-8B: 8B dense, 6GB, 39 tok/s — vision/cameras
~44GB total. Worked but routing 3 models was annoying.
THE NEW SETUP (one model)
7-model shootout, 45 tests, Claude Opus judged:
- Qwen3.5-122B-A10B UD-IQ3_S (10B active, 44GB) — 27.4 tok/s, 440/500
- VL-8B stays separate (camera contention)
- Nomic-embed for RAG
~57GB total, 39GB headroom.
WHAT IT RUNS:
Email classification (15 min cron, <2s), food app (recipes, meal plans, prep Gantt charts), finance dashboard (tax, portfolio, spending), camera person detection, Open WebUI + SearXNG, OpenCode, OpenClaw agent
SURPRISING FINDINGS:
- IQ3 scored identical to Q4_K_M (440 vs 438) at half VRAM and faster
- GLM Flash had 8 empty responses — thinking ate max_tokens
- Dense 27B was 8 tok/s on Vulkan. MoE is the way to go.
- 122B handles concurrency — emails <2s while long gen is running
- Unsloth Dynamic quants work fine on Strix Halo
QUESTIONS:
Should I look at Nemotron or other recent models?
Anyone else on Strix Halo / high-memory Vulkan running similar model lineup?
Is IQ3 really good enough long-term?
[link] [comments]