Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory

arXiv cs.AI / 4/22/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • Automated essay scoring (AES) benchmarks commonly report quadratic weighted kappa (QWK), but the paper argues that label noise from human raters makes it unclear what QWK is theoretically achievable and what is practically “enough” for real deployment.
  • Using classical test theory reliability, the authors derive two QWK ceilings that are dataset-specific and can be estimated from standard two-rater benchmarks without collecting extra annotations.
  • The “theoretical ceiling” represents the best QWK an ideal AES model could reach when it perfectly predicts latent true scores, accounting for rater label noise.
  • The “human-like ceiling” estimates the QWK achievable by a model that matches human-level scoring error, serving as a practical target especially when replacing a single human rater.
  • The study finds that human–human QWK, often used as a reference ceiling, can underestimate the true attainable ceiling, and simulation plus real-benchmark experiments validate the proposed limits.

Abstract

Automated essay scoring (AES) is commonly evaluated on public benchmarks using quadratic weighted kappa (QWK). However, because benchmark labels are assigned by human raters and inevitably contain scoring errors, it remains unclear both what QWK is theoretically attainable and what level is practically sufficient for deployment. We therefore derive two dataset-specific QWK ceilings based on the reliability concept in classical test theory, which can be estimated from standard two-rater benchmarks without additional annotation. The first is the theoretical ceiling: the maximum QWK that an ideal AES model that perfectly predicts latent true scores can achieve under label noise. The second is the human-like ceiling: the QWK attainable by an AES model with human-level scoring error, providing a practical target when AES is intended to replace a single human rater. We further show that human--human QWK, often used as a ceiling reference, can underestimate the true ceiling. Simulation experiments validate the proposed ceilings, and experiments on real benchmarks illustrate how they clarify the current performance and remaining headroom of modern AES models.