CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

arXiv cs.CL / 3/13/2026

💬 OpinionTools & Practical UsageModels & Research

共有:

Key Points

CHiL(L)Grader is a calibrated human-in-the-loop grading framework that combines uncertainty estimation with human review to improve trustworthiness in automated short-answer scoring.
It employs post-hoc temperature scaling, confidence-based selective prediction, and continual learning to automatically grade only high-confidence responses and route uncertain cases to human graders.
On three short-answer datasets, it auto-scores 35-65% of responses at expert-level quality (QWK >= 0.80), demonstrating effective use of uncertainty quantification in education AI.
Each correction cycle uses teacher feedback to strengthen the model's grading ability and adapt to evolving rubrics and unseen questions.

Abstract

Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.

Self-Refining Agents in Spec-Driven Development

Dev.to

How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)

Dev.to

Agentforce Builder: How to Build AI Agents in Salesforce

Dev.to

How AI Consulting Services Support Staff Development in Dubai

Dev.to

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

Dev.to

CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Key Points

Abstract

Related Articles

Self-Refining Agents in Spec-Driven Development

How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)

Agentforce Builder: How to Build AI Agents in Salesforce

How AI Consulting Services Support Staff Development in Dubai

Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer