Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory

arXiv cs.CL / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that evaluating LLM-based automatic short answer grading (ASAG) only with aggregate metrics (e.g., macro-F1, Cohen’s kappa) misses how performance changes across responses with different grading difficulty.
  • It proposes an evaluation framework using item response theory (IRT) to model grading correctness as a function of latent grader ability and each response’s difficulty, enabling response-level diagnostics of success and failure.
  • Experiments on SciEntsBank and Beetle using 17 open-weight LLMs show that models with similar overall scores can differ greatly in how quickly their accuracy drops as response difficulty increases.
  • The study finds that, for difficult responses, errors disproportionately map to the `partially_correct_incomplete` label, suggesting “intermediate-label collapse” under ambiguity.
  • It further characterizes difficult responses by linking higher estimated difficulty to weaker semantic alignment with the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space.

Abstract

Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen's kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the \texttt{partially\_correct\_incomplete} label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.