Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities

arXiv cs.AI / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that typical LLM evaluation that collapses performance into single aggregate scores hides important, task-specific variation in model abilities.
It proposes a cognitive diagnostic framework using multidimensional Item Response Theory to estimate fine-grained ability levels via an item–ability association matrix.
For mathematics, the authors build a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge, enabling prediction of performance on unseen benchmark questions.
Experiments across 41 models show strong criterion validity and robust predictive performance (AUC ~0.80–0.89 within benchmarks, and ~0.77–0.86 across benchmarks), outperforming trivial baselines.
The framework generalizes across domains—physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions)—and is positioned for applications like targeted training, ability-guided model selection, and benchmark design.

Abstract

Current evaluations of large language models aggregate performance across diverse tasks into single scores. This obscures fine-grained ability variation, limiting targeted model improvement and ability-guided selection for specific tasks. Motivated by this gap, we propose a cognitive diagnostic framework that estimates model abilities across multiple fine-grained dimensions. For mathematics, we construct a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge. The framework employs multidimensional Item Response Theory with an item-ability association matrix to estimate fine-grained ability levels, which in turn enable prediction of performance on unseen items (questions of benchmark). Evaluated on 41 models, our approach demonstrates strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC ranging from 0.80 to 0.89 within benchmarks and from 0.77 to 0.86 across benchmarks, substantially exceeding trivial baselines. The framework generalizes across scientific domains, producing consistent diagnostic performance in physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions). This work establishes a principled framework for fine-grained assessment of abilities, with potential applications in targeted training, ability-guided model selection, and ability-aware benchmark design.