S-GRADES -- Studying Generalization of Student Response Assessments in Diverse Evaluative Settings
arXiv cs.CL / 3/12/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- S-GRADES is a new web-based benchmark that unifies 14 grading datasets for automated essay scoring and automatic short answer grading under a single interface with standardized access and reproducible evaluation protocols.
- The benchmark is open-source and extensible, enabling ongoing addition of datasets and evaluation settings.
- The authors evaluate three state-of-the-art large language models on S-GRADES using multiple prompting strategies, and study exemplar selection and cross-dataset exemplar transfer.
- Analyses reveal reliability and generalization gaps between essay and short-answer grading tasks, underscoring the need for standardized, cross-paradigm assessment in educational NLP.
- By providing a cross-paradigm, reproducible evaluation platform, S-GRADES aims to facilitate more robust model development and fair comparison across educational assessment tasks.




