The 4B class of 2026 (benchmark)

Reddit r/LocalLLaMA / 4/28/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

A benchmark comparing 5 “4B-class” LLMs (≈3–4B parameters) across 39 tasks (finance, reasoning, and code) found NVIDIA’s Nemotron-3-Nano (4B) to be the clear overall winner.
Nemotron-3-Nano achieved standout results in finance, going 15-for-15 and showing coherent step-by-step calculations within the 1024-token budget using explicit <code></think></code> reasoning.
The test indicates that even at 3–4GB disk sizes, models exhibit distinct “specialist vs generalist” behavior: Granite-4 (IBM) favored code while Nemotron-3-Nano strongly favored reasoning.
Phi-4-mini (Microsoft) performed more balanced across categories and delivered the best efficiency in the metric of accuracy percentage per GB on disk.
Qwen 3.5 4B underperformed sharply (around 15% accuracy), often producing empty or very short responses, suggesting it struggles to complete its reasoning under the same evaluation setup.

Bench 2 from my 18GB M3 Pro. Last week was specialists vs generalists at 7-8B (which I hosed by giving thinking models a 128-token budget, so half the post was an apology). This week: the 4B class of 2026, every model released or actively-current at the 3-4B size, head-to-head on the same task suite.

Lineup (sizes on disk): gemma4:e4b 9.6 GB Google, Apr 2 2026 qwen3.5:4b 3.4 GB Alibaba, Mar 1 2026 granite4:3b 2.1 GB IBM, Oct 2025 nemotron-3-nano:4b 2.8 GB NVIDIA, Mar 2026 phi4-mini:3.8b 2.5 GB Microsoft, late 2024

39 tasks: 15 finance (P/E, NPV, CAGR, Sharpe), 15 reasoning (word problems, syllogisms, probability), 9 code (FizzBuzz-tier). 3 trials per (model × task), median aggregation. temp=0, seed=42, max_tokens=1024.

Headline: Nemotron 3 Nano won and it's not close

model overall finance reasoning code nemotron-3-nano:4b 85% 100% 80% 67% phi4-mini:3.8b 77% 80% 60% 100% gemma4:e4b 62% 60% 60% 67% granite4:3b 54% 60% 20% 100% qwen3.5:4b 15% 20% 20% 0%

NVIDIA's nano is barely a month old and went 15-for-15 on finance. Looking at the responses (visible in the gist), it's a thinking model, </think> tags before final answers, and it actually finishes its thinking inside the 1024-token budget. The reasoning is clean: "compute (1.08)^5. 1.08^2=1.1664, ^3=1.259712, ^{4=1.36048896,} ^{5=1.4693280768.} So PV = 100,000 / 1.4693280768 = approx 68,058."

That's a 2.8 GB model on disk producing the right answer with the right intermediate work. On finance specifically, it beat every larger model.

Lab personalities are real at this size

Look at the per-category lines for granite4:3b vs nemotron-3-nano:4b:

granite: code 100%, reasoning 20% nemotron: code 67%, reasoning 80%

Two ~3-4 GB models, almost-mirror-image profiles. Granite is a dedicated coder with weak reasoning. Nemotron is a dedicated reasoner with mediocre code. Both come from labs (IBM, NVIDIA) that don't position these as specialist models, they're marketed as general-purpose at this size. The marketing is wrong; the data shows clear specialization.

phi4-mini sits in between: 100% on code, 80% on finance, 60% on reasoning. The most balanced of the bunch and the bang-for-GB winner at 30.8 accuracy-pct per GB on disk.

The Qwen 3.5 4b problem

15% accuracy. 30 of 39 responses empty (avg response length: 21 chars out of a 1024-token budget). Same failure mode as Qwen3:4b in bench 1 four months ago. Thinking model that can't finish thinking inside a fixed budget that's reasonable for non-thinking models in the same weight class.

Looking at one of the truncated responses: it gets to "$$PV = \frac{100,000}{(1 + 0.08)^5}$$" and runs out of budget mid-formula. The model isn't broken; my budget gave thinking models 1024 tokens when they need 4096+ to finish. Granite finishes in ~75 tokens average, Nemotron in ~170, Qwen 3.5 4b is using its full 914 tokens on visible-plus-hidden output and still not finishing.

This is now a pattern across two bench posts. The eval ecosystem has a thinking-model-in-fixed-budget problem and I don't think the answer is "make the budget bigger", that punishes the non-thinkers with bloated runs and obscures what's actually being measured.

I'm going to try per-model token budgets in bench 3. Open to better ideas, comment if you have them.

Methodology + repo

Apple M3 Pro, 18 GB, macOS 25.5, Ollama 0.21. temp=0, seed=42, max_tokens=1024 across all models (this is the design flaw above). 3 trials per task, median aggregation. All graders are deterministic regex/numeric/exec, no LLM-as-judge.

Repo: https://github.com/joshuahickscorp/bench2 Raw JSONL with full responses + per-token timings: https://gist.github.com/joshuahickscorp/1e8947e2f14dea0930f6f33d987c335e