Concept frustration: Aligning human concepts and machine representations

arXiv stat.ML / 4/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 研究は、基礎モデルの埋め込みから得た無教師の中間表現と、人が解釈できる教師付き概念を幾何学的枠組みで比較する手法を提案しています。
  • 「概念フラストレーション(concept frustration)」を、未知の概念が既知概念間に矛盾した関係を生み出し、既存のオントロジー内で整合させられない状況として定式化しています。
  • ユークリッド距離のような従来比較では捉えにくい概念フラストレーションを、タスク整合的な類似度(task-aligned similarity)により検出できることを示しています。
  • 線形ガウス生成モデルの仮定の下で、概念ベース分類器のベイズ最適精度を既知-既知/既知-未知/未知-未知の寄与に分解し、フラストレーションが性能にどう影響するかを解析的に明らかにしています。
  • 合成データと実データ(言語・視覚)での実験により、フラストレーションが基礎モデル表現で検出可能であり、解釈可能モデルへ「フラストレーションを含む概念」を組み込むと幾何が再編され、人と機械の推論整合が改善しうることを報告しています。

Abstract

Aligning human-interpretable concepts with the internal representations learned by modern machine learning systems remains a central challenge for interpretable AI. We introduce a geometric framework for comparing supervised human concepts with unsupervised intermediate representations extracted from foundation model embeddings. Motivated by the role of conceptual leaps in scientific discovery, we formalise the notion of concept frustration: a contradiction that arises when an unobserved concept induces relationships between known concepts that cannot be made consistent within an existing ontology. We develop task-aligned similarity measures that detect concept frustration between supervised concept-based models and unsupervised representations derived from foundation models, and show that the phenomenon is detectable in task-aligned geometry while conventional Euclidean comparisons fail. Under a linear-Gaussian generative model we derive a closed-form expression for Bayes-optimal concept-based classifier accuracy, decomposing predictive signal into known-known, known-unknown and unknown-unknown contributions and identifying analytically where frustration affects performance. Experiments on synthetic data and real language and vision tasks demonstrate that frustration can be detected in foundation model representations and that incorporating a frustrating concept into an interpretable model reorganises the geometry of learned concept representations, to better align human and machine reasoning. These results suggest a principled framework for diagnosing incomplete concept ontologies and aligning human and machine conceptual reasoning, with implications for the development and validation of safe interpretable AI for high-risk applications.