Entropy Centroids as Intrinsic Rewards for Test-Time Scaling
arXiv cs.AI / 4/30/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes using an intrinsic, uncertainty-based reward signal to improve test-time scaling for large language models, avoiding the need for external reward models.
- It introduces High Entropy Phases (HEPs) as variable-length segments starting at high-entropy tokens and ending when consecutive low-entropy tokens appear, capturing uncertainty structure over time during inference.
- Building on HEPs, it defines the Entropy Centroid (a weighted average position of HEPs along the generation trajectory) to quantify how uncertainty is distributed temporally.
- It then presents the “Lowest Centroid” selection method, choosing the candidate response with the lowest entropy centroid, which the authors report consistently improves response quality.
- Experiments across math, code generation, logical reasoning, and agentic tasks—using models from 14B to 480B parameters—show stable improvements over prior selection baselines, with code provided publicly.
Related Articles
Vector DB and ANN vs PHE conflict, is there a practical workaround? [D]
Reddit r/MachineLearning

Agent Amnesia and the Case of Henry Molaison
Dev.to
Azure Weekly: Microsoft and OpenAI Restructure Partnership as GPT-5.5 Lands in Foundry
Dev.to
Proven Patterns for OpenAI Codex in 2026: Prompts, Validation, and Gateway Governance
Dev.to
Vibe coding is a tool, not a shortcut. Most people are using it wrong.
Dev.to