Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

arXiv cs.CL / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware reinforcement learning algorithm for LLMs that uses entropy as a continuous core driver rather than a discrete filter or post-hoc regulator.
  • HAPO dynamically adapts optimization using four components: adaptive temperature sampling, token-level group average advantage estimation, differential advantage redistribution, and asymmetric entropy-based adaptive clipping.
  • The method continuously tailors optimization dynamics to each token’s entropy throughout training, aiming for fine-grained regulation and better handling of sequence-length effects.
  • Experiments across mathematical reasoning, code, and logic tasks on multiple models show HAPO consistently outperforms DAPO, and the authors provide an implementation repository.
  • The work advances RL-for-LLMs research by embedding token-level heterogeneity treatment into every stage of the optimization pipeline.

Abstract

Using entropy as a measure of heterogeneity to guide optimization has emerged as a crucial research direction in Reinforcement Learning for LLMs. However, existing methods typically treat it as a discrete filter or post-hoc regulator rather than a core optimization driver. To fully leverage the potential of entropy and achieve fine-grained regulation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a token-aware algorithm that continuously adapts optimization dynamics based on token-level entropy throughout the entire training process. Our algorithm includes four key components: (1) Adaptive Temperature Sampling that adjusts sampling temperature in real time, promoting exploration at high-entropy tokens. (2) Token-Level Group Average Advantage Estimation that estimates advantages at token level, accounting for sequence-length effects while preserving non-biased treatment.(3) Differential Advantage Redistribution that leverages entropy and importance ratios to adjust advantages for tokens with clear signals. (4) Asymmetric Adaptive Clipping that adynamically adjusts clipping boundaries based on token-level entropy. Through systematic investigation of entropy, we embed token-level treatment into every stage. Extensive experiments on mathematical reasoning, code, and logic tasks across multiple models demonstrate HAPO's consistent superiority over DAPO. Our code can be found in https://github.com/starriver030515/HAPO.