PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

arXiv cs.AI / 3/24/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

PivotRL is a new post-training framework for long-horizon agentic tasks that aims to balance compute efficiency (like SFT) with out-of-domain generalization accuracy (like E2E RL).
It improves training signals by running local on-policy rollouts and selecting “pivot” intermediate turns with high outcome variance, then learning from rewards for functionally equivalent actions rather than exact string matches.
The authors provide theoretical justification that PivotRL’s pivot selection and functional-equivalent reward design promote strong learning signals while preserving relative action probability ordering outside the training tasks.
Experiments show PivotRL outperforms standard SFT by +4.17% average in-domain accuracy across four agentic domains and by +10.04% OOD accuracy in non-agentic tasks.
On agentic coding tasks, PivotRL achieves competitive results to E2E RL while requiring 4× fewer rollout turns, and it is reported to be adopted in production-scale post-training for NVIDIA’s Nemotron-3-Super-120B-A12B.

Abstract

Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with the OOD accuracy of E2E RL. PivotRL relies on two key mechanisms: first, it executes local, on-policy rollouts and filters for pivots: informative intermediate turns where sampled actions exhibit high variance in outcomes; second, it utilizes rewards for functional-equivalent actions rather than demanding strict string matching with the SFT data demonstration. We theoretically show that these mechanisms incentivize strong learning signals with high natural gradient norm, while maximally preserving policy probability ordering on actions unrelated to training tasks. In comparison to standard SFT on identical data, we demonstrate that PivotRL achieves +4.17% higher in-domain accuracy on average across four agentic domains, and +10.04% higher OOD accuracy in non-agentic tasks. Notably, on agentic coding tasks, PivotRL achieves competitive accuracy with E2E RL with 4x fewer rollout turns. PivotRL is adopted by NVIDIA's Nemotron-3-Super-120B-A12B, acting as the workhorse in production-scale agentic post-training.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/24DailyView insight →

How AI is Transforming Dynamics 365 Business Central

Dev.to

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Reddit r/artificial

Do I need different approaches for different types of business information errors?

Dev.to

ShieldCortex: What We Learned Protecting AI Agent Memory

Dev.to

WordPress Theme Customization Without Code: The AI Revolution

Dev.to

PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost

Key Points

Abstract

💡 Insights using this article

Related Articles

How AI is Transforming Dynamics 365 Business Central

Algorithmic Gaslighting: A Formal Legal Template to Fight AI Safety Pivots That Cause Psychological Harm

Do I need different approaches for different types of business information errors?

ShieldCortex: What We Learned Protecting AI Agent Memory

WordPress Theme Customization Without Code: The AI Revolution

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer