| Hardware: AMD Strix Halo (Ryzen AI MAX+ 395), 128GB RAM, 96GB shared VRAM, Vulkan/RADV, llama-server (kyuz0 Docker image) Quick disclaimer: I'm not an ML researcher or a scientist. I work in tech and I'm fairly technical, but this is purely a hobby project. The methodology isn't rigorous by academic standards. I just wanted to figure out which model works best for my stuff. I posted some early results on Qwen and some people asked me to post more about my specific tests on my own use cases. TL;DR: I run local LLMs for async tasks in my homelab. Generic benchmarks weren't helping me pick models, so I wrote my own 45-test suite based on the things I actually use LLMs for. Tested 19 models across 6 families. Gemma 4 26B-A4B ended up on top, but only after fixing two separate bugs that made it look broken on first run. Why local LLMs, and why I needed my own benchmarkI use Claude (Opus) for interactive coding and reasoning. But I also have a bunch of services running 24/7 that need a local model:
These don't need frontier quality. They need to be fast, reliable, and decent at structured output. MMLU scores and chatbot arena rankings don't tell me whether a model can write a valid Home Assistant automation or classify my Gmail correctly. So I wrote my own tests. The test suite45 tests across 12 categories. Each response scored 0-10 by Claude Opus 4.6 reading the full output against a rubric:
9 of these are "critical" tests that get weighted 2x because they map to my most common use cases. Max score is 540. Each test has a rubric that defines what a good answer looks like. For example, the memory analysis test requires the model to correctly identify that "available" memory (22G) is the real free metric, not "free" (5.7G), and that swap usage is non-critical. The tax calculation test checks that AGI, taxable income, and bracket math are all correct. After each model runs all 45 tests, Claude Opus acts as the judge using the same rubric, which lets me be consistent across all 19 models but obviously means the scores reflect one judge's interpretation. The rubrics and all raw responses are saved if anyone wants to cross-check. What I tested19 model configurations across 6 families, all on Vulkan with llama-server: Qwen family:
Gemma 4:
Others:
All tested with ResultsTop 5 by quality:
Getting Gemma 4 to actually workGemma 4 launched on April 1. When I first loaded it, 11 out of 45 tests came back with empty responses. I thought the model was broken. It wasn't. There were two separate problems. Problem 1: Thinking mode eats your tokens. Gemma 4's chat template turns on thinking by default. The model was burning all 2048 max tokens on internal blocks and returning nothing visible. Adding Problem 2: Tokenizer bug. llama.cpp had a Gemma 4 tokenizer bug (PR #21343, merged Apr 3) that was silently mangling inputs on longer prompts. After pulling the updated Docker image, Gemma scores jumped 20-23 points across all variants. Without both fixes, Gemma 4 scored below Coder-Next. With them, it took first place. If you tried Gemma 4 on launch day and it seemed bad, try again with updated llama.cpp and thinking disabled. Quantization comparisonI tested 5 different quants of Gemma 4 26B to see how much bit depth matters:
On Coder-Next, ggml actually scored 2 points higher than Unsloth. There isn't a clear universal winner between quantizers. I'd say pick Unsloth for Gemma and ggml for Qwen, but the differences are small enough that it probably doesn't matter. Things I didn't expectMoE models are the only option on Vulkan. Everything with 3-10B active params runs at 40-60+ tok/s. Dense models above 9B are too slow to be practical. The Qwen3.5-27B (dense) ran at 6-8 tok/s in my March testing and timed out on most tests. If you're on an iGPU or APU with shared VRAM, don't bother with dense models. Thinking mode will silently break your setup. Multiple model families (Gemma, Qwen3.5, GPT-OSS*) enable thinking by default in their chat templates. If you're using llama-server and getting empty or truncated responses, look for Tokenizer bugs have more impact than quant choice. The Gemma tokenizer fix moved scores by 20+ points. Going from Q4 to Q8 only moved them by 8-15. Keep your llama.cpp build up to date, especially right after new model architectures drop. GPT-OSS* doesn't work properly on llama-server. The harmony response format produces empty outputs on roughly 25% of prompts regardless of what reasoning settings I tried. The 120B was mostly usable (3 empty out of 45) but the 20B was not (12 empty). If someone has figured out how to fix this, let me know. Nemotron Cascade-2 surprised me. 62 tok/s, 417/540, 24G VRAM, zero crashes. Back in March the Nemotron-3-Super would crash after 20 sequential requests. The Cascade-2 ran all 45 tests cleanly. Mamba-2 hybrid on Vulkan finally seems stable. What I'm running nowSwitching from Coder-Next to:
The Q8 and IQ3 together use 37G of my 96G GTT. That leaves 59G for KV cache, which is more room than I've had with any previous config. Methodology
[link] [comments] |
45-test benchmark around my homelab use cases and testing 19 local LLMs (incl. Gemma 4 and Qwen 3.5) on a Strix Halo
Reddit r/LocalLLaMA / 4/4/2026
💬 OpinionDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- The author built a custom 45-test benchmark tailored to their homelab workflows (email classification, vision-based camera alert descriptions, meal planning, finance analysis, and Home Assistant automation YAML generation) because standard public benchmarks didn’t predict reliability and structured output quality for their use cases.
- Using an AMD Strix Halo system (Ryzen AI MAX+ 395) with 128GB RAM and a Vulkan/RADV setup via a llama-server Docker image, they evaluated 19 local LLMs across six model families.
- The benchmark scores each response (0–10) by having Claude Opus 4.6 grade the full outputs against rubrics spanning 12 categories such as coding, homelab operations/debugging, and tool-calling tasks.
- Gemma 4 26B-A4B ranked highest after the author fixed two separate bugs that initially caused the model to appear broken, highlighting how test implementation issues can skew comparisons.
- The methodology is explicitly presented as hobby-grade rather than academically rigorous, but it is aimed at practical decision-making for which local model performs best for specific, recurring automation tasks.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business

Black Hat Asia
AI Business

Claude Code’s Source Leaks, OpenAI Exits Video Generation, Gemini Adds Music Generation, LLMs Learn at Inference
The Batch

MCP Observability: Logging, Auditing, and Debugging Agent-Server Interactions in Production
Dev.to

Why OpenClaw Agents Lose Their Minds Mid-Session (And What It Takes to Fix It)
Dev.to