Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams
arXiv cs.AI / 4/15/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The study evaluates the reliability of AI-assisted rubric scoring for handwritten undergraduate physics responses using GPT-4o, comparing results with instructor ratings across two scoring rounds.
- Human–AI agreement on total scores was similar to human inter-rater reliability overall, but agreement dropped for mid-level performances where reasoning is partial or ambiguous.
- Criterion-level results showed stronger alignment for clearly defined conceptual skills than for longer, more subjective procedural judgments.
- A more fine-grained, checklist-style skill rubric improved scoring consistency compared with holistic rubrics, indicating rubric structure is the primary driver of reliability.
- Systematic tests found prompting format had a secondary effect and the model temperature had relatively limited impact, yielding practical recommendations for implementing reliable LLM-assisted STEM scoring.
Related Articles
Vibe Coding Is Changing How We Build Software. ERP Teams Should Pay Attention
Dev.to
I scanned every major vibe coding tool for security. None scored above 90.
Dev.to
I Finally Checked What My AI Coding Tools Actually Cost. The Number Made No Sense.
Dev.to
Is it actually possible to build a model-agnostic persistent text layer that keeps AI behavior stable?
Reddit r/artificial
Give me your ideass [N]
Reddit r/MachineLearning