Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
arXiv cs.CL / 5/4/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that evaluating LLM-based automatic short answer grading (ASAG) only with aggregate metrics (e.g., macro-F1, Cohen’s kappa) misses how performance changes across responses with different grading difficulty.
- It proposes an evaluation framework using item response theory (IRT) to model grading correctness as a function of latent grader ability and each response’s difficulty, enabling response-level diagnostics of success and failure.
- Experiments on SciEntsBank and Beetle using 17 open-weight LLMs show that models with similar overall scores can differ greatly in how quickly their accuracy drops as response difficulty increases.
- The study finds that, for difficult responses, errors disproportionately map to the `partially_correct_incomplete` label, suggesting “intermediate-label collapse” under ambiguity.
- It further characterizes difficult responses by linking higher estimated difficulty to weaker semantic alignment with the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

You Are Right — You Don't Need CLAUDE.md
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to