DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

arXiv cs.LG / 4/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key challenge in RLVR for LLMs: balancing exploration and exploitation during training, especially for “extremely hard” versus “easy” samples.
  • It proposes DiPO, using a perplexity-space disentangling strategy that splits samples into high-perplexity (exploration) and low-perplexity (exploitation) subspaces for a more fine-grained trade-off.
  • DiPO also introduces a bidirectional reward allocation mechanism to guide perplexity-based exploration and exploitation while minimizing disruptions to verification rewards.
  • Experiments on mainstream tasks including mathematical reasoning and function calling show improved, more stable policy optimization and superior performance versus prior approaches.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed significant advances in the reasoning capabilities of Large Language Models (LLMs). However, effectively managing the exploration and exploitation trade-off remains a critical challenge. In this paper, we fully analyze the exploration and exploitation dilemma of extremely hard and easy samples during the training and propose a new fine-grained trade-off mechanism. Concretely, we introduce a perplexity space disentangling strategy that divides the sample space into distinct exploration (high perplexity) and exploitation (low perplexity) subspaces, thereby mining fine-grained samples requiring exploration-exploitation trade-off. Subsequently, we propose a bidirectional reward allocation mechanism with a minimum impact on verification rewards to implement perplexity-guided exploration and exploitation, enabling more stable policy optimization. Finally, we have evaluated our method on two mainstream tasks: mathematical reasoning and function calling, and experimental results demonstrate the superiority of the proposed method, confirming its effectiveness in enhancing LLM performance by fine-grained exploration-exploitation trade-off.