SciEval: A Benchmark for Automatic Evaluation of K-12 Science Instructional Materials
arXiv cs.AI / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SciEval, a benchmark dataset for automatically evaluating K-12 science instructional materials, targeting the scalability problem of expert-led reviews.
- It formulates Automatic Instructional Materials Evaluation (AIME) as a generative AI task that predicts rubric-based scores and evidence aligned with educator-designed criteria (EQuIP), based on an expert-annotated dataset of 273 lesson-level materials.
- Testing mainstream LLMs (GPT, Gemini, Llama, Qwen) on SciEval shows they do not deliver strong performance out of the box, indicating reliability gaps on this education-domain evaluation.
- Fine-tuning Qwen3 on SciEval yields up to 11% performance gains on a held-out test set, suggesting that domain-specific training is key to making LLM-based automated evaluation viable for instructional materials.
Related Articles

What to Build Still Beats How
Dev.to

I Build Systems, Flip Land, and Drop Trap Music — Meet Tyler Moncrieff aka Father Dust
Dev.to

From Claim Denials to Smart Decisions: My Experience Using AI in Healthcare Claims Processing
Dev.to

Whatsapp AI booking system in one prompt in 5 minutes
Dev.to
v0.22.1
Ollama Releases