Temperature-Dependent Performance of Prompting Strategies in Extended Reasoning Large Language Models

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates how sampling temperature and prompting strategy interact in extended reasoning LLMs, focusing on chain-of-thought versus zero-shot prompting.
  • Using Grok-4.1 with extended reasoning on 39 AMO-Bench (IMO-level) math problems, zero-shot prompting peaks at moderate temperatures (59% accuracy at T=0.4 and T=0.7).
  • In contrast, chain-of-thought prompting yields its best results at the temperature extremes (T=0.0 and T=1.0).
  • The study finds that the advantage of extended reasoning grows substantially with temperature, rising from 6x speed/accuracy benefit at T=0.0 to 14.3x at T=1.0.
  • Overall, the results argue that temperature should be tuned jointly with prompting strategy rather than defaulting to T=0 for reasoning tasks.

Abstract

Extended reasoning models represent a transformative shift in Large Language Model (LLM) capabilities by enabling explicit test-time computation for complex problem solving. However, the optimal configuration of sampling temperature and prompting strategy for these systems remains largely underexplored. We systematically evaluate chain-of-thought and zero-shot prompting across four temperature settings (0.0, 0.4, 0.7, and 1.0) using Grok-4.1 with extended reasoning on 39 mathematical problems from AMO-Bench, a challenging International Mathematical Olympiad-level benchmark. We find that zero-shot prompting achieves peak performance at moderate temperatures, reaching 59% accuracy at T=0.4 and T=0.7, while chain-of-thought prompting performs best at the temperature extremes. Most notably, the benefit of extended reasoning increases from 6x at T=0.0 to 14.3x at T=1.0. These results suggest that temperature should be optimized jointly with prompting strategy, challenging the common practice of using T=0 for reasoning tasks.