LLMs Do Not Grade Essays Like Humans

arXiv cs.CL / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper evaluates whether LLMs can match human essay grades, finding that overall agreement between model-generated scores and human ratings is relatively weak and varies by essay characteristics.
  • LLMs show systematic bias versus human raters, tending to give higher scores to short or underdeveloped essays and lower scores to longer essays that have minor grammar or spelling errors.
  • The study finds that LLM scores are internally consistent with the feedback they produce: essays receiving more praise are scored higher, while essays receiving more criticism are scored lower.
  • The authors conclude that even with coherent scoring/feedback patterns, LLMs rely on different signals than humans, limiting alignment with human grading practices.
  • Despite limited human-score alignment, the paper suggests LLM-generated feedback can still be used reliably as support for automated essay scoring workflows.

Abstract

Large language models have recently been proposed as tools for automated essay scoring, but their agreement with human grading remains unclear. In this work, we evaluate how LLM-generated scores compare with human grades and analyze the grading behavior of several models from the GPT and Llama families in an out-of-the-box setting, without task-specific training. Our results show that agreement between LLM and human scores remains relatively weak and varies with essay characteristics. In particular, compared to human raters, LLMs tend to assign higher scores to short or underdeveloped essays, while assigning lower scores to longer essays that contain minor grammatical or spelling errors. We also find that the scores generated by LLMs are generally consistent with the feedback they generate: essays receiving more praise tend to receive higher scores, while essays receiving more criticism tend to receive lower scores. These results suggest that LLM-generated scores and feedback follow coherent patterns but rely on signals that differ from those used by human raters, resulting in limited alignment with human grading practices. Nevertheless, our work shows that LLMs produce feedback that is consistent with their grading and that they can be reliably used in supporting essay scoring.