CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading
arXiv cs.CL / 3/13/2026
💬 OpinionTools & Practical UsageModels & Research
Key Points
- CHiL(L)Grader is a calibrated human-in-the-loop grading framework that combines uncertainty estimation with human review to improve trustworthiness in automated short-answer scoring.
- It employs post-hoc temperature scaling, confidence-based selective prediction, and continual learning to automatically grade only high-confidence responses and route uncertain cases to human graders.
- On three short-answer datasets, it auto-scores 35-65% of responses at expert-level quality (QWK >= 0.80), demonstrating effective use of uncertainty quantification in education AI.
- Each correction cycle uses teacher feedback to strengthen the model's grading ability and adapt to evolving rubrics and unseen questions.
Related Articles
Self-Refining Agents in Spec-Driven Development
Dev.to
How to Optimize Your LinkedIn Profile with AI in 2026 (Get Found by Recruiters)
Dev.to
Agentforce Builder: How to Build AI Agents in Salesforce
Dev.to
How AI Consulting Services Support Staff Development in Dubai
Dev.to
Week 3: Why I'm Learning 'Boring' ML Before Building with LLMs
Dev.to