Confidence Estimation in Automatic Short Answer Grading with LLMs

arXiv cs.CL / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies how to estimate confidence reliably for Automatic Short Answer Grading (ASAG) using generative LLMs to support safe human-AI educational decisions.
  • It compares three model-based confidence estimation approaches—verbalizing, latent, and consistency-based—and finds that model-based signals alone do not capture ASAG uncertainty reliably.
  • The authors propose a hybrid framework that combines model-based confidence with an explicit estimate of dataset-derived (aleatoric) uncertainty.
  • Aleatoric uncertainty is estimated by clustering semantically embedded student responses and measuring heterogeneity within each cluster.
  • Experiments show that the hybrid confidence metric improves both the reliability of confidence estimates and selective grading performance versus single-source methods.

Abstract

Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.