PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost
arXiv cs.AI / 2026/3/24
📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- PivotRL is a new post-training framework for long-horizon agentic tasks that aims to balance compute efficiency (like SFT) with out-of-domain generalization accuracy (like E2E RL).
- It improves training signals by running local on-policy rollouts and selecting “pivot” intermediate turns with high outcome variance, then learning from rewards for functionally equivalent actions rather than exact string matches.
- The authors provide theoretical justification that PivotRL’s pivot selection and functional-equivalent reward design promote strong learning signals while preserving relative action probability ordering outside the training tasks.
- Experiments show PivotRL outperforms standard SFT by +4.17% average in-domain accuracy across four agentic domains and by +10.04% OOD accuracy in non-agentic tasks.
- On agentic coding tasks, PivotRL achieves competitive results to E2E RL while requiring 4× fewer rollout turns, and it is reported to be adopted in production-scale post-training for NVIDIA’s Nemotron-3-Super-120B-A12B.

