45-test benchmark around my homelab use cases and testing 19 local LLMs (incl. Gemma 4 and Qwen 3.5) on a Strix Halo

Reddit r/LocalLLaMA / 4/4/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The author built a custom 45-test benchmark tailored to their homelab workflows (email classification, vision-based camera alert descriptions, meal planning, finance analysis, and Home Assistant automation YAML generation) because standard public benchmarks didn’t predict reliability and structured output quality for their use cases.
  • Using an AMD Strix Halo system (Ryzen AI MAX+ 395) with 128GB RAM and a Vulkan/RADV setup via a llama-server Docker image, they evaluated 19 local LLMs across six model families.
  • The benchmark scores each response (0–10) by having Claude Opus 4.6 grade the full outputs against rubrics spanning 12 categories such as coding, homelab operations/debugging, and tool-calling tasks.
  • Gemma 4 26B-A4B ranked highest after the author fixed two separate bugs that initially caused the model to appear broken, highlighting how test implementation issues can skew comparisons.
  • The methodology is explicitly presented as hobby-grade rather than academically rigorous, but it is aimed at practical decision-making for which local model performs best for specific, recurring automation tasks.
45-test benchmark around my homelab use cases and testing 19 local LLMs (incl. Gemma 4 and Qwen 3.5) on a Strix Halo

Hardware: AMD Strix Halo (Ryzen AI MAX+ 395), 128GB RAM, 96GB shared VRAM, Vulkan/RADV, llama-server (kyuz0 Docker image)

Quick disclaimer: I'm not an ML researcher or a scientist. I work in tech and I'm fairly technical, but this is purely a hobby project. The methodology isn't rigorous by academic standards. I just wanted to figure out which model works best for my stuff. I posted some early results on Qwen and some people asked me to post more about my specific tests on my own use cases.

TL;DR: I run local LLMs for async tasks in my homelab. Generic benchmarks weren't helping me pick models, so I wrote my own 45-test suite based on the things I actually use LLMs for. Tested 19 models across 6 families. Gemma 4 26B-A4B ended up on top, but only after fixing two separate bugs that made it look broken on first run.

Why local LLMs, and why I needed my own benchmark

I use Claude (Opus) for interactive coding and reasoning. But I also have a bunch of services running 24/7 that need a local model:

  • Email classification runs every 15 minutes, sorting 50+ emails into categories
  • Camera notifications use a vision model to describe what triggered a motion alert before pushing to my phone
  • Meal planning generates weekly plans with dietary constraints for two people
  • Finance analysis calculates tax scenarios and portfolio projections
  • Home Assistant automations get generated and validated as YAML

These don't need frontier quality. They need to be fast, reliable, and decent at structured output. MMLU scores and chatbot arena rankings don't tell me whether a model can write a valid Home Assistant automation or classify my Gmail correctly. So I wrote my own tests.

The test suite

45 tests across 12 categories. Each response scored 0-10 by Claude Opus 4.6 reading the full output against a rubric:

  • Coding (4 tests): Docker Compose, systemd services, Python scripts, code review
  • Homelab ops (6 tests): Memory analysis, OOM debugging, disk triage, network debug, log parsing
  • Tool calling (5 tests): Proxmox pct/qm commands, SSH chains, Docker ops, git workflows
  • Food/meal planning (6 tests): JSON meal plans, prep schedules, recipe scaling, shopping lists, nutrition
  • Finance (5 tests): Tax calculations, portfolio analysis, FIRE projections, tax-loss harvesting
  • Email classification (3 tests): Category assignment, ambiguous cases, unsubscribe decisions
  • Home Assistant (3 tests): Automation YAML, template sensors, conditions
  • Math (4 tests): Mortgage payoff, probability, number theory, tax optimization
  • Reasoning (3 tests): Energy bills, statistics, logic constraints
  • Instruction following (3 tests): Format compliance, JSON output, negative constraints
  • Long context (1 test): Extract facts from 8K-token infrastructure doc
  • Speed (2 tests): Time-to-first-token, sustained generation

9 of these are "critical" tests that get weighted 2x because they map to my most common use cases. Max score is 540.

Each test has a rubric that defines what a good answer looks like. For example, the memory analysis test requires the model to correctly identify that "available" memory (22G) is the real free metric, not "free" (5.7G), and that swap usage is non-critical. The tax calculation test checks that AGI, taxable income, and bracket math are all correct. After each model runs all 45 tests, Claude Opus acts as the judge using the same rubric, which lets me be consistent across all 19 models but obviously means the scores reflect one judge's interpretation. The rubrics and all raw responses are saved if anyone wants to cross-check.

What I tested

19 model configurations across 6 families, all on Vulkan with llama-server:

Qwen family:

  • Qwen3.5-122B-A10B (10B active MoE) - was my production model until last month
  • Qwen3-Coder-Next 80B-A3B (3B active MoE) - current production model
  • Qwen3.5-35B-A3B (3B active MoE)
  • Multiple quant variants: Unsloth IQ3/IQ4/Q4/Q8 and ggml Q4

Gemma 4:

  • Gemma 4 26B-A4B (3.8B active MoE) - launched Apr 1
  • Gemma 4 E4B (4.5B dense) - tiny multimodal model
  • Multiple quants, both Unsloth and ggml

Others:

  • GPT-OSS 20B and 120B (OpenAI's open models) - incomplete runs, see note below
  • Nemotron Cascade-2 30B-A3B (NVIDIA, Mamba-2 hybrid)
  • GLM-4.7-Flash (Zhipu)
  • Mistral Small 4 119B (6.5B active MoE)

All tested with reasoning = off (more on why below).

https://preview.redd.it/7oahi27wh1tg1.png?width=2080&format=png&auto=webp&s=44333dad9680333d162065170571b3b37f614f49

Results

https://preview.redd.it/u06cdf6zh1tg1.png?width=1930&format=png&auto=webp&s=e249a2226cd25e1720c1ef13dc73da6a494bbabc

Top 5 by quality:

Rank Model Score tok/s VRAM
1 Gemma 4 26B UD-Q8_K_XL 438/540 (81%) 41 26G
2 Gemma 4 26B ggml Q8_0 435/540 (81%) 43 26G
3 Qwen3.5-122B UD-IQ3_S 432/540 (80%) 27 44G
4 Gemma 4 26B UD-Q4_K_XL 430/540 (80%) 47 16G
5 Coder-Next ggml Q4_K_M 428/540 (79%) 52 46G

Getting Gemma 4 to actually work

Gemma 4 launched on April 1. When I first loaded it, 11 out of 45 tests came back with empty responses. I thought the model was broken. It wasn't. There were two separate problems.

Problem 1: Thinking mode eats your tokens. Gemma 4's chat template turns on thinking by default. The model was burning all 2048 max tokens on internal blocks and returning nothing visible. Adding reasoning = off to the llama-server config fixed it. Same thing happened with Qwen3.5 (32 out of 45 tests empty on the 122B). GPT-OSS* uses a "harmony" format with the same issue, and I never fully got that one working.

Problem 2: Tokenizer bug. llama.cpp had a Gemma 4 tokenizer bug (PR #21343, merged Apr 3) that was silently mangling inputs on longer prompts. After pulling the updated Docker image, Gemma scores jumped 20-23 points across all variants.

https://preview.redd.it/e2dfgkz1i1tg1.png?width=1630&format=png&auto=webp&s=25df3ab37ff8df972a4d0be94f3693e4871bd1d8

Without both fixes, Gemma 4 scored below Coder-Next. With them, it took first place. If you tried Gemma 4 on launch day and it seemed bad, try again with updated llama.cpp and thinking disabled.

Quantization comparison

I tested 5 different quants of Gemma 4 26B to see how much bit depth matters:

https://preview.redd.it/yji3h6p5i1tg1.png?width=1931&format=png&auto=webp&s=52ed55b0d6f71b9c64f83690dce2d7ff937ccb4c

  • IQ3 at 11G gets 98% of Q4's quality, uses 35% less VRAM, and is 24% faster
  • Q8 scores the highest (438 vs 423-430) but needs 2.4x the VRAM of IQ3
  • Unsloth Dynamic quants scored 3-5 points higher than ggml-org at the same bit depth, though ggml was slightly faster

https://preview.redd.it/gko3zjk8i1tg1.png?width=1331&format=png&auto=webp&s=7301864760b34a647eab455c1ca5d4bc95017d70

On Coder-Next, ggml actually scored 2 points higher than Unsloth. There isn't a clear universal winner between quantizers. I'd say pick Unsloth for Gemma and ggml for Qwen, but the differences are small enough that it probably doesn't matter.

Things I didn't expect

MoE models are the only option on Vulkan. Everything with 3-10B active params runs at 40-60+ tok/s. Dense models above 9B are too slow to be practical. The Qwen3.5-27B (dense) ran at 6-8 tok/s in my March testing and timed out on most tests. If you're on an iGPU or APU with shared VRAM, don't bother with dense models.

Thinking mode will silently break your setup. Multiple model families (Gemma, Qwen3.5, GPT-OSS*) enable thinking by default in their chat templates. If you're using llama-server and getting empty or truncated responses, look for thinking = 1 in the server logs and add reasoning = off to your config. For some models this was the difference between scoring 0 and scoring 438.

Tokenizer bugs have more impact than quant choice. The Gemma tokenizer fix moved scores by 20+ points. Going from Q4 to Q8 only moved them by 8-15. Keep your llama.cpp build up to date, especially right after new model architectures drop.

GPT-OSS* doesn't work properly on llama-server. The harmony response format produces empty outputs on roughly 25% of prompts regardless of what reasoning settings I tried. The 120B was mostly usable (3 empty out of 45) but the 20B was not (12 empty). If someone has figured out how to fix this, let me know.

Nemotron Cascade-2 surprised me. 62 tok/s, 417/540, 24G VRAM, zero crashes. Back in March the Nemotron-3-Super would crash after 20 sequential requests. The Cascade-2 ran all 45 tests cleanly. Mamba-2 hybrid on Vulkan finally seems stable.

What I'm running now

Switching from Coder-Next to:

  • Primary: Gemma 4 26B-A4B UD-Q8_K_XL (26G) for quality-sensitive tasks
  • Fast secondary: Gemma 4 26B-A4B UD-IQ3_S (11G) for email classification and agent loops
  • Vision: keeping Qwen3-VL-8B for camera snapshots for now

The Q8 and IQ3 together use 37G of my 96G GTT. That leaves 59G for KV cache, which is more room than I've had with any previous config.

https://preview.redd.it/rovrjtcbi1tg1.png?width=1623&format=png&auto=webp&s=17930b4f86c1b02dba57e9ebdf4b51b6eb7267c7

Methodology

  • Temperature 0, max_tokens 2048 (4096 for sustained generation test)
  • One model loaded at a time, no multi-model serving during tests
  • Claude Opus 4.6 scored each response against the rubric
  • Empty responses (model generated tokens but visible output was blank) scored 0
  • GPT-OSS* scores have asterisks because they didn't complete all tests
  • Happy to share the test suite, rubrics, and raw JSON if anyone wants to run the same tests on their hardware
submitted by /u/MBAThrowawayFruit
[link] [comments]