The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

arXiv cs.AI / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The study finds that strategically choosing LLMs and reasoning settings can outperform ensembling approaches for automated scoring accuracy.
  • Temperature-based sampling (stochastic calls) improved scoring accuracy compared with deterministic generation, while increasing self-consistency ensemble size from j=1 to j=7 did not yield significant gains.
  • Raising “reasoning effort” led to a significant positive linear improvement in scoring accuracy, but the magnitude of benefit depended on the model family.
  • An efficiency frontier analysis compares configurations by accuracy vs. cost, identifying Gemini 3.1 Pro Preview with low reasoning as the most accurate yet expensive option, and GPT-5.4 Nano/Mini without reasoning as the best cost-performance trade-off.

Abstract

Strategic model selection and reasoning settings are more effective than ensembling for optimizing automated scoring with large language models (LLMs). We examined self-consistency (intra-model majority voting) and reasoning effort for scoring conversation-based assessment items in high school mathematics, evaluating 900 student conversations against human-scored ground truths using frontier and low-cost models from OpenAI and Google. Temperature sampling significantly improved accuracy over deterministic calls, but increasing ensemble size (j = 1 to 7) produced no significant gains. Higher reasoning effort showed a significant positive linear trend with scoring accuracy, though the benefit varied by model family. An efficiency frontier analysis identified Gemini 3.1 Pro Preview at low reasoning as the most accurate but costly configuration; GPT-5.4 Nano and Mini with no reasoning offered the best cost-performance balance.