Leakage and Interpretability in Concept-Based Models

arXiv stat.ML / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Concept-Based Models are proposed to improve interpretability by predicting intermediate human-understandable concepts, but they can fail due to information leakage embedded in the learned concept representations.
  • The paper introduces an information-theoretic framework that defines two quantitative metrics—concepts-task leakage (CTL) and interconcept leakage (ICL)—to rigorously characterize and measure leakage.
  • The CTL and ICL scores are shown to strongly predict how models will behave under interventions and to outperform existing leakage-related measures.
  • The authors identify main causes of leakage and, in a case study of Concept Embedding Models, find additional leakage modes beyond the leakage that is already present by design (including interconcept and alignment leakage).
  • The paper concludes with practical design guidelines intended to reduce leakage and maintain interpretability in concept-based model architectures.

Abstract

Concept-based Models aim to improve interpretability by predicting high-level intermediate concepts, representing a promising approach for deployment in high-risk scenarios. However, they are known to suffer from information leakage, whereby models exploit unintended information encoded within the learned concepts. We introduce an information-theoretic framework to rigorously characterise and quantify leakage, and define two complementary measures: the concepts-task leakage (CTL) and interconcept leakage (ICL) scores. We show that these measures are strongly predictive of model behaviour under interventions and outperform existing alternatives. Using this framework, we identify the primary causes of leakage and, as a case study, analyse how it manifests in Concept Embedding Models, revealing interconcept and alignment leakage in addition to the concepts-task leakage present by design. Finally, we present a set of practical guidelines for designing concept-based models to reduce leakage and ensure interpretability.