LLMs Do Not Grade Essays Like Humans
arXiv cs.CL / 3/26/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper evaluates whether LLMs can match human essay grades, finding that overall agreement between model-generated scores and human ratings is relatively weak and varies by essay characteristics.
- LLMs show systematic bias versus human raters, tending to give higher scores to short or underdeveloped essays and lower scores to longer essays that have minor grammar or spelling errors.
- The study finds that LLM scores are internally consistent with the feedback they produce: essays receiving more praise are scored higher, while essays receiving more criticism are scored lower.
- The authors conclude that even with coherent scoring/feedback patterns, LLMs rely on different signals than humans, limiting alignment with human grading practices.
- Despite limited human-score alignment, the paper suggests LLM-generated feedback can still be used reliably as support for automated essay scoring workflows.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to
How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)
Dev.to
How Should Students Document AI Usage in Academic Work?
Dev.to

I asked my AI agent to design a product launch image. Here's what came back.
Dev.to