Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory
arXiv cs.AI / 4/22/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- Automated essay scoring (AES) benchmarks commonly report quadratic weighted kappa (QWK), but the paper argues that label noise from human raters makes it unclear what QWK is theoretically achievable and what is practically “enough” for real deployment.
- Using classical test theory reliability, the authors derive two QWK ceilings that are dataset-specific and can be estimated from standard two-rater benchmarks without collecting extra annotations.
- The “theoretical ceiling” represents the best QWK an ideal AES model could reach when it perfectly predicts latent true scores, accounting for rater label noise.
- The “human-like ceiling” estimates the QWK achievable by a model that matches human-level scoring error, serving as a practical target especially when replacing a single human rater.
- The study finds that human–human QWK, often used as a reference ceiling, can underestimate the true attainable ceiling, and simulation plus real-benchmark experiments validate the proposed limits.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence
Dev.to
Context Engineering for Developers: A Practical Guide (2026)
Dev.to
GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to
AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now
Dev.to
I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to