Consolidated my homelab from 3 models down to one 122B MoE — benchmarked everything, here's what I found

Reddit r/LocalLLaMA / 3/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical Usage

共有:

Key Points

The author consolidates a Strix Halo homelab LLM deployment from three separate text/vision models (about 44GB total) into a single 122B MoE model plus a separate 8B vision model, simplifying routing while keeping performance acceptable.

Been running local LLMs on a Strix Halo setup (Ryzen AI MAX+ 395, 128GB RAM, 96 GiB shared GPU memory via Vulkan/RADV) under Proxmox with LXC containers and llama-server. Wanted to share where I landed after way too much benchmarking.

THE OLD SETUP (3 text models)

- GLM-4.7-Flash: 30B MoE 3B active, 18GB, 72 tok/s — daily driver, email

- Qwen3.5-35B-A3B: 35B MoE 3B active, 20GB, 55 tok/s — reasoning/coding

- Qwen3-VL-8B: 8B dense, 6GB, 39 tok/s — vision/cameras

~44GB total. Worked but routing 3 models was annoying.

THE NEW SETUP (one model)

7-model shootout, 45 tests, Claude Opus judged:

- Qwen3.5-122B-A10B UD-IQ3_S (10B active, 44GB) — 27.4 tok/s, 440/500

- VL-8B stays separate (camera contention)

- Nomic-embed for RAG

~57GB total, 39GB headroom.

WHAT IT RUNS:

Email classification (15 min cron, <2s), food app (recipes, meal plans, prep Gantt charts), finance dashboard (tax, portfolio, spending), camera person detection, Open WebUI + SearXNG, OpenCode, OpenClaw agent

SURPRISING FINDINGS:

- IQ3 scored identical to Q4_K_M (440 vs 438) at half VRAM and faster

- GLM Flash had 8 empty responses — thinking ate max_tokens

- Dense 27B was 8 tok/s on Vulkan. MoE is the way to go.

- 122B handles concurrency — emails <2s while long gen is running

- Unsloth Dynamic quants work fine on Strix Halo

QUESTIONS:

Should I look at Nemotron or other recent models?
Anyone else on Strix Halo / high-memory Vulkan running similar model lineup?
Is IQ3 really good enough long-term?

submitted by /u/MBAThrowawayFruit
[link] [comments]

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Dev.to

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

Dev.to

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Dev.to

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Dev.to

Retraining vs Fine-tuning or Transfer Learning? [D]

Reddit r/MachineLearning

Consolidated my homelab from 3 models down to one 122B MoE — benchmarked everything, here's what I found

Key Points

Related Articles

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Retraining vs Fine-tuning or Transfer Learning? [D]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer