| So I have been running some pretty demanding benchmarks on local models lately, and last week I posted results showing that Qwen 3.5 4B Q8 passed every single one of my custom tests. I was genuinely impressed. Then Nemotron 3 4B dropped today and I immediately grabbed the Q8 quant to put it through the same gauntlet. Spoiler: it did not go well. The thing that had me excited about Nemotron was its different architecture, which supposedly allows for much larger context windows. That sounded promising on paper. Unfortunately, raw context capacity means nothing if the model cannot reason correctly inside that context. Here is every test I ran, the exact prompts, and what each model actually produced. TEST 1: Dense multi-part math and structured output The prompt asked the model to:
Qwen 3.5 4B Q8 (correct):
Every sub-task correct. Clean JSON. Math checks out. Nemotron 3 nano 4B Q8 (wrong):
The pseudocode for part C was padded with 6 lines of just "#" to reach 14 lines. The proof in part A had wrong substitution steps. Part E had no digits but the comma placement was wrong and word count was off. It got lucky on a few numerical answers but failed the reasoning and format behind almost everything. TEST 2: Full algorithmic design with C++17 The prompt asked for:
Qwen 3.5 4B Q8 (correct): Described 3D Mo clearly with incremental add/remove using divisor lists and Möbius weights. Produced all 24 pseudocode lines within the character and variable name limits. C++17 code was logically correct and compilable. Example outputs: [5, 2, 0, 2]. Nemotron 3 nano 4B Q8 (wrong): The JSON had malformed arrays. The C++ code had syntax errors and undefined variable references and would not compile. The pseudocode had 16 real lines and 8 "#" padding lines. The example outputs were wrong. TEST 3: Pattern compression inference The prompt was simply:
Qwen 3.5 4B Q8 (correct): Correctly identified the rule as floor(count / 2) for each character, preserving input order. Showed the working: - A appears 3 times → floor(3/2) = 1 - B appears 3 times → floor(3/2) = 1 - Y appears 1 time → floor(1/2) = 0 (removed) - U appears 1 time → floor(1/2) = 0 (removed) - D appears 2 times → floor(2/2) = 1 Answer: ABD Nemotron 3 nano 4B Q8 (wrong): Answered AABBBY, showing it had no real understanding of the rule and was pattern-matching superficially without reasoning through the character counts. TEST 4: UI and frontend generation I asked both to generate a business dashboard and a SaaS landing page with pricing. The screenshot comparison says everything. Qwen produced a fully structured dashboard with labeled KPI cards (Revenue, Orders, Refunds, Conversion Rate), a smooth area chart, a donut chart for traffic sources, and a complete landing page with three pricing tiers at R$29, R$79, and R$199 with feature lists and styled buttons. Nemotron produced an almost empty layout with two placeholder numbers and no charts, and a landing page that was a purple gradient with a single button and the same testimonial card duplicated twice. It looks like a template that forgot to load its content. Overall verdict Nemotron 3 nano 4B Q8 failed all four tests. Qwen 3.5 4B Q8 passed all four last week. The architecture novelty that enables larger contexts did not translate into better reasoning, instruction following, structured output, or code generation. If you are picking between these two for local use right now it is not even a close call. Full Qwen results from last week in the comments. [link] [comments] |
I was hyped for Nemotron 3 4B and it completely disappointed me compared to Qwen 3.5 4B
Reddit r/LocalLLaMA / 3/17/2026
📰 NewsSignals & Early TrendsTools & Practical UsageModels & Research
Key Points
- Nemotron 3 4B was released recently but did not outperform Qwen 3.5 4B Q8 in the author's benchmarks.
- The author argues that a larger context window is useless if the model cannot reason correctly inside that context.
- In a series of targeted tests (dense multi-part math, modular arithmetic, a Möbius/inclusion-exclusion algorithm, a Lucas theorem calculation, and a constrained Portuguese paragraph), Qwen 3.5 4B Q8 consistently produced correct results while Nemotron 3 4B fell short.
- The takeaway is that architecture promises like larger context windows do not guarantee better practical reasoning or tool usage, at least in this evaluation.
Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO
Dev.to

Why Regex is Not Enough: Building a Deterministic "Sudo" Layer for AI Agents
Dev.to

Perplexity Hub
Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide
Dev.to