The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

arXiv cs.CL / 4/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces the “Metacognitive Monitoring Battery,” a cross-domain benchmark designed to measure how LLMs couple monitoring and control using the Nelson & Narens metacognitive framework and human psychometric methods.
  • The benchmark includes 524 pre-registered (T1–T5) and exploratory (T6) tasks spanning six cognitive domains, using adaptive dual probes after each forced-choice answer to test KEEP/WITHDRAW and BET/decline behavior.
  • Its key metric, the “withdraw delta,” captures how much withdrawal rates differ between incorrect versus correct answers, enabling identification of three behavioral profiles: blanket confidence, blanket withdrawal, and selective sensitivity.
  • Experiments on 20 frontier LLMs (10,480 evaluations) show largely inverted relationships between accuracy ranking and metacognitive sensitivity ranking, while retrospective monitoring and prospective regulation appear weakly related.
  • The authors release items, data, and code and report convergence with an independent Type-2 SDT method, with metacognitive calibration patterns varying by model architecture (e.g., Qwen decreasing, GPT-5.4 increasing, Gemma flat).

Abstract

We introduce a cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework and applying human psychometric methodology to LLM evaluation. The battery comprises 524 items across six cognitive domains (learning, metacognitive calibration, social cognition, attention, executive function, prospective regulation), each grounded in an established experimental paradigm. Tasks T1-T5 were pre-registered on OSF prior to data collection; T6 was added as an exploratory extension. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. The critical metric is the withdraw delta: the difference in withdrawal rate between incorrect and correct items. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson-Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Retrospective monitoring and prospective regulation appear dissociable (r = .17, 95% CI wide given n=20; exemplar-based evidence is the primary support). Scaling on metacognitive calibration is architecture-dependent: monotonically decreasing (Qwen), monotonically increasing (GPT-5.4), or flat (Gemma). Behavioural findings converge structurally with an independent Type-2 SDT approach, providing preliminary cross-method construct validity. All items, data, and code: https://github.com/synthiumjp/metacognitive-monitoring-battery.