ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing

arXiv cs.CL / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that conventional knowledge tracing (KT) systems mainly predict whether students answer correctly and lack diagnostic insight into the specific conceptual gaps causing mistakes.
  • It introduces a new task, concept-level deficiency prediction, to forecast which concepts a student is likely to struggle with in future problems.
  • The authors present ConceptKT, a knowledge tracing dataset with annotations covering both the required concepts for each question and the missing concepts implied by incorrect responses.
  • Experiments study in-context learning for KT, evaluate multiple large language/reasoning model approaches for both correctness and concept-level diagnosis, and compare strategies for choosing informative historical records.
  • Results indicate that history selection using conceptual alignment and semantic similarity improves performance for both correctness prediction and deficiency identification.

Abstract

Knowledge Tracing (KT) is a critical technique for modeling student knowledge to support personalized learning. However, most KT systems focus on binary correctness prediction and cannot diagnose the underlying conceptual misunderstandings that lead to errors. Such fine-grained diagnostic feedback is essential for designing targeted instruction and effective remediation. In this work, we introduce the task of concept-level deficiency prediction, which extends traditional KT by identifying the specific concepts a student is likely to struggle with on future problems. We present ConceptKT, a dataset annotated with labels that capture both the concepts required to solve each question and the missing concepts underlying incorrect responses. We investigate in-context learning approaches to KT and evaluate the diagnostic capabilities of various Large Language Models (LLMs) and Large Reasoning Models (LRMs). Different strategies for selecting informative historical records are explored. Experimental results demonstrate that selecting response histories based on conceptual alignment and semantic similarity leads to improved performance on both correctness prediction and concept-level deficiency identification.