Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

arXiv cs.LG / 4/15/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

Nemotron 3 Superは、MambaとTransformerのハイブリッドにMixture-of-Experts（LatentMoE）を組み合わせた120B級（活性12B）のモデルとして、事前学習・後学習・量子化までを含めてarXivで概要が公開された。
NVFP4での事前学習や、MTP層によるnative speculative decodingを通じた推論加速など、効率と推論性能を重視した設計が示されている。
25兆トークンでの事前学習の後、SFTとRLによる後学習が行われ、最大1Mコンテキストに対応しつつ一般的ベンチマークで同等精度を狙っている。
GPT-OSS-120BおよびQwen3.5-122Bに対して、推論スループットで最大2.2倍および7.5倍の向上が報告されている。
学習データやベース/後学習/量子化チェックポイントがHugging Faceでオープンソースとして公開される点が大きなポイントである。

Abstract

We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.