AI Navigate

Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

arXiv cs.CL / 3/20/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study defines entropy-trajectory monotonicity—per-step answer-distribution entropy must decrease at every reasoning step—and shows that, using sampling of a few answer completions per step, monotone chains achieve 68.8% accuracy versus 46.8% for non-monotone chains on GSM8K with Qwen2.5-7B-Instruct.
  • It finds that total entropy reduction is not predictive, indicating a shape-over-magnitude effect where the presence of monotonic entropy decline matters more than how much entropy decreases (rho = -0.06, p = 0.31).
  • Across monotonicity violation counts (0/1/2), accuracy scales from 68.8% to 50.8% to 28.6%, highlighting how strict monotonicity strengthens performance.
  • Monotonicity provides a +5.8 percentage-point gain at 73.7% coverage and remains cost-effective, requiring about 1,500 tokens per question—roughly one-eighth the cost of 40-chain self-consistency.
  • The results replicate on Mistral-7B (n=300): monotone chains reach 72.3% versus 37.6% ( +34.7 pp; OR=4.33 ), suggesting the phenomenon generalizes across models.

Abstract

Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps--captured by sampling a few answer completions per step--predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher's p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive (\rho=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question--1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.