The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost
arXiv cs.AI / 5/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study finds that strategically choosing LLMs and reasoning settings can outperform ensembling approaches for automated scoring accuracy.
- Temperature-based sampling (stochastic calls) improved scoring accuracy compared with deterministic generation, while increasing self-consistency ensemble size from j=1 to j=7 did not yield significant gains.
- Raising “reasoning effort” led to a significant positive linear improvement in scoring accuracy, but the magnitude of benefit depended on the model family.
- An efficiency frontier analysis compares configurations by accuracy vs. cost, identifying Gemini 3.1 Pro Preview with low reasoning as the most accurate yet expensive option, and GPT-5.4 Nano/Mini without reasoning as the best cost-performance trade-off.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER