From Test-taking to Cognitive Scaffolding: A Pedagogical Diagnostic Benchmark for LLMs on English Standardized Tests
arXiv cs.CL / 5/1/2026
💬 OpinionModels & Research
Key Points
- The paper argues that current LLM evaluations on standardized tests mainly measure only binary correctness, which is insufficient for educational tutoring that needs faithful reasoning and misconception diagnosis.
- It introduces a pedagogical diagnostic framework that represents English Standardized Test (EST) problem-solving as navigation through a cognitive framework, enabling more fine-grained diagnostics.
- Based on this framework, the authors release ESTBook, a multimodal benchmark with 10,576 questions across 29 task types spanning five major exams.
- ESTBook is designed to capture not just answers but also formalized reasoning trajectories and distractor rationales reflecting specific cognitive traps, supporting guided elicitation and performance-gap mitigation.
- Experiments reported in the study indicate that identifying cognitive trajectories improves pedagogical reasoning and helps reduce performance gaps in LLMs.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’
The Register
Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats
Reddit r/LocalLLaMA
![Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]](/_next/image?url=https%3A%2F%2Fpreview.redd.it%2Fvutakjb0vgyg1.png%3Fwidth%3D140%26height%3D59%26auto%3Dwebp%26s%3D08ecb95fd65ade25c924988f1992e9abe3d79f62&w=3840&q=75)
Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]
Reddit r/MachineLearning