Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities
arXiv cs.AI / 4/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that typical LLM evaluation that collapses performance into single aggregate scores hides important, task-specific variation in model abilities.
- It proposes a cognitive diagnostic framework using multidimensional Item Response Theory to estimate fine-grained ability levels via an item–ability association matrix.
- For mathematics, the authors build a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge, enabling prediction of performance on unseen benchmark questions.
- Experiments across 41 models show strong criterion validity and robust predictive performance (AUC ~0.80–0.89 within benchmarks, and ~0.77–0.86 across benchmarks), outperforming trivial baselines.
- The framework generalizes across domains—physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions)—and is positioned for applications like targeted training, ability-guided model selection, and benchmark design.
Related Articles

As China’s biotech firms shift gears, can AI floor the accelerator?
SCMP Tech

Why AI Teams Are Standardizing on a Multi-Model Gateway
Dev.to

a claude code/codex plugin to run autoresearch on your repository
Dev.to

AI startup claims to automate app making but actually just uses humans
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to