Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that common LLM confidence calibration metrics (e.g., ECE, Brier score) mix two abilities—Type-1 sensitivity (how much the model knows) and Type-2 metacognitive sensitivity (how well it knows what it knows).
  • It proposes an evaluation framework using Type-2 Signal Detection Theory, introducing meta-d' and an M-ratio to separately measure metacognitive capacity and metacognitive efficiency.
  • Experiments on four LLMs across 224,000 factual QA trials show large differences in metacognitive efficiency even when Type-1 sensitivity is similar, including cases where a model ranks highest by d' but lowest by M-ratio.
  • The study finds metacognitive efficiency is domain-specific and can be shifted by temperature changes, indicating that confidence policy (Type-2 criterion) can move independently of underlying metacognitive capacity for some models.
  • It reports that AUROC_2 and M-ratio can produce fully inverted model rankings, suggesting these metrics answer fundamentally different evaluation questions, with implications for model selection and deployment.

Abstract

Standard evaluation of LLM confidence relies on calibration metrics (ECE, Brier score) that conflate two distinct capacities: how much a model knows (Type-1 sensitivity) and how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce an evaluation framework based on Type-2 Signal Detection Theory that decomposes these capacities using meta-d' and the metacognitive efficiency ratio M-ratio. Applied to four LLMs (Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3, Llama-3-8B-Base, Gemma-2-9B-Instruct) across 224,000 factual QA trials, we find: (1) metacognitive efficiency varies substantially across models even when Type-1 sensitivity is similar -- Mistral achieves the highest d' but the lowest M-ratio; (2) metacognitive efficiency is domain-specific, with different models showing different weakest domains, invisible to aggregate metrics; (3) temperature manipulation shifts Type-2 criterion while meta-d' remains stable for two of four models, dissociating confidence policy from metacognitive capacity; (4) AUROC_2 and M-ratio produce fully inverted model rankings, demonstrating these metrics answer fundamentally different evaluation questions. The meta-d' framework reveals which models "know what they don't know" versus which merely appear well-calibrated due to criterion placement -- a distinction with direct implications for model selection, deployment, and human-AI collaboration. Pre-registered analysis; code and data publicly available.
広告