From Test-taking to Cognitive Scaffolding: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests

arXiv cs.CL / 5/1/2026

💬 OpinionModels & Research

共有:

Key Points

The paper argues that current LLM evaluations on standardized tests mainly measure only binary correctness, which is insufficient for educational tutoring that needs faithful reasoning and misconception diagnosis.
It introduces a pedagogical diagnostic framework that represents English Standardized Test (EST) problem-solving as navigation through a cognitive framework, enabling more fine-grained diagnostics.
Based on this framework, the authors release ESTBook, a multimodal benchmark with 10,576 questions across 29 task types spanning five major exams.
ESTBook is designed to capture not just answers but also formalized reasoning trajectories and distractor rationales reflecting specific cognitive traps, supporting guided elicitation and performance-gap mitigation.
Experiments reported in the study indicate that identifying cognitive trajectories improves pedagogical reasoning and helps reduce performance gaps in LLMs.

Abstract

As large language models (LLMs) are increasingly integrated into educational tools, current evaluations on standardized tests predominantly focus on binary outcome accuracy. Instead, an effective AI tutor must exhibit faithful reasoning, elucidate solution strategies, and diagnose specific human misconceptions. To bridge this gap, we introduce a pedagogical diagnostic framework that models English Standardized Test (EST) problem-solving as a traversal through a cognitive framework. Based on this framework, we present ESTBook, a multimodal benchmark encompassing 10,576 questions and 29 task types across five major exams. Unlike traditional datasets, ESTBook goes beyond data aggregation by enriching questions with formalized reasoning trajectories and distractor rationales that capture specific cognitive traps. Through extensive evaluations, we empirically demonstrate the practical utility of our diagnostic framework, showing that identifying cognitive trajectories facilitates the mitigation of performance gap and improves pedagogical reasoning through guided elicitation.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

The Register

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Reddit r/LocalLLaMA

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

Reddit r/MachineLearning

From Test-taking to Cognitive Scaffolding: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer