| Bench 2 from my 18GB M3 Pro. Last week was specialists vs generalists at 7-8B (which I hosed by giving thinking models a 128-token budget, so half the post was an apology). This week: the 4B class of 2026, every model released or actively-current at the 3-4B size, head-to-head on the same task suite. Lineup (sizes on disk): gemma4:e4b 9.6 GB Google, Apr 2 2026 qwen3.5:4b 3.4 GB Alibaba, Mar 1 2026 granite4:3b 2.1 GB IBM, Oct 2025 nemotron-3-nano:4b 2.8 GB NVIDIA, Mar 2026 phi4-mini:3.8b 2.5 GB Microsoft, late 2024 39 tasks: 15 finance (P/E, NPV, CAGR, Sharpe), 15 reasoning (word problems, syllogisms, probability), 9 code (FizzBuzz-tier). 3 trials per (model × task), median aggregation. temp=0, seed=42, max_tokens=1024. Headline: Nemotron 3 Nano won and it's not closemodel overall finance reasoning code nemotron-3-nano:4b 85% 100% 80% 67% phi4-mini:3.8b 77% 80% 60% 100% gemma4:e4b 62% 60% 60% 67% granite4:3b 54% 60% 20% 100% qwen3.5:4b 15% 20% 20% 0% NVIDIA's nano is barely a month old and went 15-for-15 on finance. Looking at the responses (visible in the gist), it's a thinking model, That's a 2.8 GB model on disk producing the right answer with the right intermediate work. On finance specifically, it beat every larger model. Lab personalities are real at this sizeLook at the per-category lines for granite4:3b vs nemotron-3-nano:4b: granite: code 100%, reasoning 20% nemotron: code 67%, reasoning 80% Two ~3-4 GB models, almost-mirror-image profiles. Granite is a dedicated coder with weak reasoning. Nemotron is a dedicated reasoner with mediocre code. Both come from labs (IBM, NVIDIA) that don't position these as specialist models, they're marketed as general-purpose at this size. The marketing is wrong; the data shows clear specialization. phi4-mini sits in between: 100% on code, 80% on finance, 60% on reasoning. The most balanced of the bunch and the bang-for-GB winner at 30.8 accuracy-pct per GB on disk. The Qwen 3.5 4b problem15% accuracy. 30 of 39 responses empty (avg response length: 21 chars out of a 1024-token budget). Same failure mode as Qwen3:4b in bench 1 four months ago. Thinking model that can't finish thinking inside a fixed budget that's reasonable for non-thinking models in the same weight class. Looking at one of the truncated responses: it gets to "$$PV = \frac{100,000}{(1 + 0.08)5}$$" and runs out of budget mid-formula. The model isn't broken; my budget gave thinking models 1024 tokens when they need 4096+ to finish. Granite finishes in ~75 tokens average, Nemotron in ~170, Qwen 3.5 4b is using its full 914 tokens on visible-plus-hidden output and still not finishing. This is now a pattern across two bench posts. The eval ecosystem has a thinking-model-in-fixed-budget problem and I don't think the answer is "make the budget bigger", that punishes the non-thinkers with bloated runs and obscures what's actually being measured. I'm going to try per-model token budgets in bench 3. Open to better ideas, comment if you have them. Methodology + repoApple M3 Pro, 18 GB, macOS 25.5, Ollama 0.21. temp=0, seed=42, max_tokens=1024 across all models (this is the design flaw above). 3 trials per task, median aggregation. All graders are deterministic regex/numeric/exec, no LLM-as-judge. Repo: https://github.com/joshuahickscorp/bench2 Raw JSONL with full responses + per-token timings: https://gist.github.com/joshuahickscorp/1e8947e2f14dea0930f6f33d987c335e Up nextBench 3: lab personalities deep-dive. Should land in 3 days. [link] [comments] |
The 4B class of 2026 (benchmark)
Reddit r/LocalLLaMA / 4/28/2026
💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- A benchmark comparing 5 “4B-class” LLMs (≈3–4B parameters) across 39 tasks (finance, reasoning, and code) found NVIDIA’s Nemotron-3-Nano (4B) to be the clear overall winner.
- Nemotron-3-Nano achieved standout results in finance, going 15-for-15 and showing coherent step-by-step calculations within the 1024-token budget using explicit <code></think></code> reasoning.
- The test indicates that even at 3–4GB disk sizes, models exhibit distinct “specialist vs generalist” behavior: Granite-4 (IBM) favored code while Nemotron-3-Nano strongly favored reasoning.
- Phi-4-mini (Microsoft) performed more balanced across categories and delivered the best efficiency in the metric of accuracy percentage per GB on disk.
- Qwen 3.5 4B underperformed sharply (around 15% accuracy), often producing empty or very short responses, suggesting it struggles to complete its reasoning under the same evaluation setup.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat USA
AI Business

China’s DeepSeek prices new V4 AI model at 97% below OpenAI’s GPT-5.5
SCMP Tech

I built Dispatch AI. I just wanted to share it. If you find it cool, take a look and leave a comment.
Dev.to

Replit AI Agent: Practical Guide for Dev Workflows
Dev.to

Open source Xiaomi MiMo-V2.5 and V2.5-Pro are among the most efficient (and affordable) at agentic 'claw' tasks
VentureBeat