Confidence Estimation in Automatic Short Answer Grading with LLMs
arXiv cs.CL / 5/4/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies how to estimate confidence reliably for Automatic Short Answer Grading (ASAG) using generative LLMs to support safe human-AI educational decisions.
- It compares three model-based confidence estimation approaches—verbalizing, latent, and consistency-based—and finds that model-based signals alone do not capture ASAG uncertainty reliably.
- The authors propose a hybrid framework that combines model-based confidence with an explicit estimate of dataset-derived (aleatoric) uncertainty.
- Aleatoric uncertainty is estimated by clustering semantically embedded student responses and measuring heterogeneity within each cluster.
- Experiments show that the hybrid confidence metric improves both the reliability of confidence estimates and selective grading performance versus single-source methods.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

You Are Right — You Don't Need CLAUDE.md
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to