MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models

arXiv cs.LG / 4/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MIRROR, a benchmark with eight experiments across four metacognitive levels to test whether large language models (LLMs) can use self-knowledge to improve decision-making.
  • Across roughly 250,000 evaluation instances covering 16 models from 8 labs, the authors find a consistent failure of compositional self-prediction on multi-domain tasks, with large Compositional Calibration Error ranges.
  • While models show above-chance but imperfect domain-specific self-knowledge, they still systematically fail to convert that partial awareness into correct agentic action selection.
  • External metacognitive control markedly reduces confident failures (from 0.600 to 0.143), whereas providing models with their own calibration scores yields no statistically significant improvement (p > 0.05), suggesting architectural constraints/scaffolding are key.
  • The authors plan to publicly release the code, data, and Croissant metadata for the benchmark.

Abstract

We introduce MIRROR, a benchmark comprising eight experiments across four metacognitive levels that evaluates whether large language models can use self-knowledge to make better decisions. We evaluate 16 models from 8 labs across approximately 250,000 evaluation instances using five independent behavioral measurement channels. Core experiments are run across the full model roster; experiments with specialized infrastructure requirements report explicitly marked model subsets. We find two phenomena with direct implications for agentic deployment: (1) compositional self-prediction fails universally -- the Compositional Calibration Error ranges from 0.500 to 0.943 on the original 15-model Exp3-v1 set (and 0.434 to 0.758 on the balanced 16-model Exp3-v2 expansion), indicating that models cannot predict their own performance on multi-domain tasks, and (2) models exhibit above-chance but imperfect domain-specific self-knowledge yet systematically fail to translate even this partial awareness into appropriate agentic action-selection -- external metacognitive control reduces the Confident Failure Rate from 0.600 to 0.143 (76% reduction at temperature 0; mean 70% at temperature 0.7 across 5 models from 4 labs). Providing models with their own calibration scores produces no significant improvement (p > 0.05); only architectural constraint is effective. This suggests that external metacognitive scaffolding -- not improved self-knowledge -- is the path to safer autonomous AI systems. Code, data, and Croissant metadata will be released publicly with the benchmark.