Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

arXiv cs.CL / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Nemotron-Cascade introduces “cascaded domain-wise reinforcement learning” (Cascade RL) to handle cross-domain heterogeneity in general reasoning tasks, such as varying response lengths and verification latency.
The method trains sequentially by domain rather than mixing heterogeneous prompts, aiming to reduce engineering complexity while maintaining performance between instruct mode and deep-thinking mode.
The authors report that using RLHF as a pre-step improves reasoning ability beyond what preference optimization alone achieves, and that later domain-wise RLVR stages typically do not degrade earlier benchmark gains.
A 14B model trained with this RL pipeline is claimed to outperform its SFT teacher (DeepSeek-R1-0528) on LiveCodeBench v5/v6/Pro and to reach silver-medal performance in the 2025 IOI.
The paper states that it transparently shares training and data recipes, supporting reproducibility and adoption by others building reasoning models.

Abstract

Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop Nemotron-Cascade, capable of operating in both instruct and deep thinking modes, without any performance gap relative to a thinking-only counterpart. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

Dev.to

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Key Points

Abstract

Related Articles

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

Hermes Agent: A Self-Improving AI Agent That Runs Anywhere

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer