Confidence Estimation in Automatic Short Answer Grading with LLMs

arXiv cs.CL / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies how to estimate confidence reliably for Automatic Short Answer Grading (ASAG) using generative LLMs to support safe human-AI educational decisions.
It compares three model-based confidence estimation approaches—verbalizing, latent, and consistency-based—and finds that model-based signals alone do not capture ASAG uncertainty reliably.
The authors propose a hybrid framework that combines model-based confidence with an explicit estimate of dataset-derived (aleatoric) uncertainty.
Aleatoric uncertainty is estimated by clustering semantically embedded student responses and measuring heterogeneity within each cluster.
Experiments show that the hybrid confidence metric improves both the reliability of confidence estimates and selective grading performance versus single-source methods.

Abstract

Automatic Short Answer Grading (ASAG) with generative large language models (LLMs) has recently demonstrated strong performance without task-specific fine-tuning, while also enabling the generation of synthetic feedback for educational assessment. Despite these advances, LLM-based grading remains imperfect, making reliable confidence estimates essential for safe and effective human-AI collaboration in educational decision-making. In this work, we investigate confidence estimation for ASAG with LLMs by jointly considering model-based confidence signals and dataset-derived uncertainty. We systematically compare three model-based confidence estimation strategies, namely verbalizing, latent, and consistency-based confidence estimation, and show that model-based confidence alone is insufficient to reliably capture uncertainty in ASAG. To address this limitation, we propose a hybrid confidence framework that integrates model-based confidence signals with an explicit estimate of dataset-derived aleatoric uncertainty. Aleatoric uncertainty is operationalized by clustering semantically embedded student responses and quantifying within-cluster heterogeneity. Our results demonstrate that the proposed hybrid confidence measure yields more reliable confidence estimates and improves selective grading performance compared to single-source approaches. Overall, this work advances confidence-aware LLM-based grading for human-in-the-loop assessment, supporting more trustworthy AI-assisted educational assessment systems.