DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

arXiv cs.LG / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key challenge in RLVR for LLMs: balancing exploration and exploitation during training, especially for “extremely hard” versus “easy” samples.
It proposes DiPO, using a perplexity-space disentangling strategy that splits samples into high-perplexity (exploration) and low-perplexity (exploitation) subspaces for a more fine-grained trade-off.
DiPO also introduces a bidirectional reward allocation mechanism to guide perplexity-based exploration and exploitation while minimizing disruptions to verification rewards.
Experiments on mainstream tasks including mathematical reasoning and function calling show improved, more stable policy optimization and superior performance versus prior approaches.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.

Introducing Claude Opus 4.7

Anthropic News

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

Dev.to

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp

Dev.to

Config-first code generator to replace repetitive AI boilerplate — looking for feedback and collaborators

Dev.to

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs

Dev.to

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Key Points

Abstract

Related Articles

Introducing Claude Opus 4.7

Who Audits the Auditors? Building an LLM-as-a-Judge for Agentic Reliability

"Enterprise AI Cost Optimization: How Companies Are Cutting AI Infrastructure Sp

Config-first code generator to replace repetitive AI boilerplate — looking for feedback and collaborators

The US Government Fired 40% of an Agency, Then Asked AI to Do Their Jobs

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer