| I haven't seen many people talking about NVIDIA's new Nemotron-3-Nano model, which was released just a couple of days ago... so, I decided to build a WebGPU demo for it! Everything runs locally in your browser (using Transformers.js). On my M4 Max, I get ~75 tokens per second - not bad! It's a 4B hybrid Mamba + Attention model, designed to be capable of both reasoning and non-reasoning tasks. Link to demo (+ source code): https://huggingface.co/spaces/webml-community/Nemotron-3-Nano-WebGPU [link] [comments] |
Nemotron-3-Nano (4B), new hybrid Mamba + Attention model from NVIDIA, running locally in your browser on WebGPU.
Reddit r/LocalLLaMA / 3/20/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- NVIDIA released Nemotron-3-Nano, a 4B hybrid Mamba + Attention model designed to handle both reasoning and non-reasoning tasks.
- A WebGPU-based demo runs entirely locally in the browser (via Transformers.js), showcasing client-side inference without a server.
- The demo reports around 75 tokens per second on an M4 Max, illustrating practical on-device performance for a small model.
- The project provides links to a HuggingFace Spaces demo and source code for experimentation.
Related Articles

ベテランの若手育成負担を減らせ、PLC制御の「ラダー図」をAIで生成
日経XTECH

Your AI generated code is "almost right", and that is actually WORSE than it being "wrong".
Dev.to

Lessons from Academic Plagiarism Tools for SaaS Product Development
Dev.to

Windsurf’s New Pricing Explained: Simpler AI Coding or Hidden Trade-Offs?
Dev.to

Building Production RAG Systems with PostgreSQL: Complete Implementation Guide
Dev.to