AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

arXiv cs.AI / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces AEM (Adaptive Entropy Modulation), a supervision-free credit assignment method for multi-turn LLM agent reinforcement learning under sparse, outcome-only rewards.
  • Instead of adding dense intermediate supervision (e.g., process reward models or auxiliary signals), AEM adaptively modulates entropy dynamics to improve the exploration–exploitation trade-off during training.
  • The authors provide theoretical analysis by shifting entropy considerations from token level to response level to reduce sampling variance and characterize entropy drift under natural gradients.
  • They derive a practical proxy that reshapes training dynamics to enable an automatic transition from exploration to exploitation.
  • Experiments across benchmarks and models from 1.5B to 32B parameters show AEM’s effectiveness, including a 1.4% improvement when applied to a state-of-the-art approach on SWE-bench-Verified.

Abstract

Reinforcement learning (RL) has significantly advanced the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. Yet effective training remains challenging, as sparse, outcome-only rewards make it difficult to assign credit to individual steps in an agent's action trajectory. A common remedy is to introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, but this increases supervision and tuning complexity and often generalizes poorly across tasks and domains. This paper presents AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to achieve a more effective exploration-exploitation trade-off. Theoretically, we elevate entropy analysis from the token level to the response level to reduce token sampling variance and show that entropy drift under natural gradients is intrinsically governed by the product of the advantage and the relative response surprisal. Specifically, we derive a practical proxy to reshape training dynamics, enabling a natural transition from exploration to exploitation. Extensive experiments across various benchmarks and models ranging from 1.5B to 32B parameters demonstrate the effectiveness of AEM, including a notable 1.4 percent gain when integrated into a state-of-the-art baseline on the highly challenging SWE-bench-Verified benchmark.