Targeted Exploration via Unified Entropy Control for Reinforcement Learning

arXiv cs.AI / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper identifies a key issue with Group Relative Policy Optimization (GRPO) in reinforcement learning: entropy collapse leads the policy to converge too early and reduces diversity.
It introduces Unified Entropy Control for Reinforcement Learning (UEC-RL), which increases exploration selectively on difficult prompts while preventing entropy from growing uncontrollably to stabilize training.
The framework is designed to expand the search space when needed without sacrificing optimization stability during learning.
Experiments on both LLM and VLM reasoning tasks show consistent improvements over RL baselines, measured by Pass@1 and Pass@k.
On the Geometry3K benchmark, UEC-RL delivers a 37.9% relative improvement over GRPO, and the authors release the code on GitHub.

Abstract

Recent advances in reinforcement learning (RL) have improved the reasoning capabilities of large language models (LLMs) and vision-language models (VLMs). However, the widely used Group Relative Policy Optimization (GRPO) consistently suffers from entropy collapse, causing the policy to converge prematurely and lose diversity. Existing exploration methods introduce additional bias or variance during exploration, making it difficult to maintain optimization stability. We propose Unified Entropy Control for Reinforcement Learning (UEC-RL), a framework that provides targeted mechanisms for exploration and stabilization. UEC-RL activates more exploration on difficult prompts to search for potential and valuable reasoning trajectories. In parallel, a stabilizer prevents entropy from growing uncontrollably, thereby keeping training stable as the model consolidates reliable behaviors. Together, these components expand the search space when needed while maintaining robust optimization throughout training. Experiments on both LLM and VLM reasoning tasks show consistent gains over RL baselines on both Pass@1 and Pass@

k

. On Geometry3K, UEC-RL achieves a 37.9\% relative improvement over GRPO, indicating that it sustains effective exploration without compromising convergence and underscoring UEC-RL as a key for scaling RL-based reasoning in large models. Our code is available at https://github.com/597358816/UEC-RL.

FastAPI With LangChain and MongoDB

Dev.to

[Patterns] AI Agent Error Handling That Actually Works

Dev.to

Building ONNX Embedding Workflows in Oracle AI Database with Python

Dev.to

🌱 Green Habit Tracker

Dev.to

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

Dev.to

Targeted Exploration via Unified Entropy Control for Reinforcement Learning

Key Points

Abstract

Related Articles

FastAPI With LangChain and MongoDB

[Patterns] AI Agent Error Handling That Actually Works

Building ONNX Embedding Workflows in Oracle AI Database with Python

🌱 Green Habit Tracker

[2026] OpenTelemetry for LLM Observability — Self-Hosted Setup

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer