T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

arXiv cs.AI / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper argues that persistent instability in stable multi-turn RL training is largely caused by inefficient exploration, where agents repeatedly take low-information actions that neither reduce uncertainty nor advance the task.
It introduces T$^2$PO (Token- and Turn-level Policy Optimization), an uncertainty-aware framework that controls exploration at fine granularity.
At the token level, T$^2$PO tracks uncertainty changes and triggers a “thinking” intervention when the marginal reduction in uncertainty falls below a threshold.
At the turn level, it detects turns that make little exploration progress and dynamically resamples those turns to avoid wasting rollouts.
Experiments on WebShop, ALFWorld, and Search QA show improved training stability and performance, and the authors provide code via the linked GitHub repository.

Abstract

Recent progress in multi-turn reinforcement learning (RL) has significantly improved reasoning LLMs' performances on complex interactive tasks. Despite advances in stabilization techniques such as fine-grained credit assignment and trajectory filtering, instability remains pervasive and often leads to training collapse. We argue that this instability stems from inefficient exploration in multi-turn settings, where policies continue to generate low-information actions that neither reduce uncertainty nor advance task progress. To address this issue, we propose Token- and Turn-level Policy Optimization (T

^2

PO), an uncertainty-aware framework that explicitly controls exploration at fine-grained levels. At the token level, T

^2

PO monitors uncertainty dynamics and triggers a thinking intervention once the marginal uncertainty change falls below a threshold. At the turn level, T

^2

PO identifies interactions with negligible exploration progress and dynamically resamples such turns to avoid wasted rollouts. We evaluate T

^2

PO in diverse environments, including WebShop, ALFWorld, and Search QA, demonstrating substantial gains in training stability and performance improvements with better exploration efficiency. Code is available at: https://github.com/WillDreamer/T2PO.

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

Dev.to

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Dev.to

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

Dev.to

v1.83.14-stable.patch.2

LiteLLM Releases

T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

Key Points

Abstract

Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

The Refund Buried in Export Paperwork: Why Customs Drawback Claim Assembly Fits an Agent Better Than Another Research Bo

Gemini File Generation Guide: How to Create PDFs, Word Docs & Excel Files with AI (2026)

v1.83.14-stable.patch.2

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer