Entropy Centroids as Intrinsic Rewards for Test-Time Scaling

arXiv cs.AI / 4/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes using an intrinsic, uncertainty-based reward signal to improve test-time scaling for large language models, avoiding the need for external reward models.
  • It introduces High Entropy Phases (HEPs) as variable-length segments starting at high-entropy tokens and ending when consecutive low-entropy tokens appear, capturing uncertainty structure over time during inference.
  • Building on HEPs, it defines the Entropy Centroid (a weighted average position of HEPs along the generation trajectory) to quantify how uncertainty is distributed temporally.
  • It then presents the “Lowest Centroid” selection method, choosing the candidate response with the lowest entropy centroid, which the authors report consistently improves response quality.
  • Experiments across math, code generation, logical reasoning, and agentic tasks—using models from 14B to 480B parameters—show stable improvements over prior selection baselines, with code provided publicly.

Abstract

An effective way to scale up test-time compute of large language models is to sample multiple responses and then select the best one, as in Grok Heavy and Gemini Deep Think. Existing selection methods often rely on external reward models, which requires training a strong reward model and introduces additional computation overhead. As an alternative, previous approaches have explored intrinsic signals, such as confidence and entropy, but these signals are noisy with naive aggregation. In this work, we observe that high-entropy tokens tend to cluster into consecutive groups during inference, providing a more stable notion of model uncertainty than individual tokens. Together, these clusters reveal temporal patterns of model uncertainty throughout the inference process. Motivated by this observation, we propose to use the temporal structure of uncertainty as an intrinsic reward. To this end, we first formalize the basic unit of segment-level uncertainty as the High Entropy Phase (HEP), a variable-length segment that begins at a high-entropy token and ends when consecutive low-entropy tokens appear. We then define the Entropy Centroid, inspired by the concept of the center of mass in physics, as the weighted average position of all HEPs along the trajectory. Intuitively, a lower centroid indicates early exploration followed by confident generation, which we find often corresponds to higher response quality. Based on this insight, we propose the Lowest Centroid method, which selects the response with the lowest entropy centroid among multiple candidates. Experiments on mathematics, code generation, logical reasoning, and agentic tasks, across model scales ranging from 14B to 480B, show that Lowest Centroid consistently outperforms existing baselines and delivers stable gains as model size increases. Code is available at https://github.com/hkust-nlp/entropy-centroid.