MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models

arXiv cs.AI / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MonitorBench, an open-source benchmark designed to evaluate chain-of-thought (CoT) monitorability in large language models (LLMs) when CoTs may not reflect decision-critical factors behind final answers.
  • MonitorBench includes 1,514 carefully constructed test instances across 19 tasks grouped into 7 categories, targeting conditions under which CoTs can serve as reliable monitors of LLM decision factors.
  • Experiments across multiple popular LLMs find that monitorability tends to be higher when producing the final response requires structural reasoning over the decision-critical factors.
  • The study reports that closed-source models generally achieve lower monitorability and that monitorability can negatively correlate with model capability.
  • Using two stress-test settings, the authors show that both open- and closed-source LLMs can deliberately degrade monitorability, with decreases up to ~30% in tasks that don’t rely on structural reasoning over decision-critical factors.

Abstract

Large language models (LLMs) can generate chains of thought (CoTs) that are not always causally responsible for their final outputs. When such a mismatch occurs, the CoT no longer faithfully reflects the decision-critical factors driving the model's behavior, leading to the reduced CoT monitorability problem. However, a comprehensive and fully open-source benchmark for studying CoT monitorability remains lacking. To address this gap, we propose MonitorBench, a systematic benchmark for evaluating CoT monitorability in LLMs. MonitorBench provides: (1) a diverse set of 1,514 test instances with carefully designed decision-critical factors across 19 tasks spanning 7 categories to characterize when CoTs can be used to monitor the factors driving LLM behavior; and (2) two stress-test settings to quantify the extent to which CoT monitorability can be degraded. Extensive experiments across multiple popular LLMs with varying capabilities show that CoT monitorability is higher when producing the final target response requires structural reasoning through the decision-critical factor. Closed-source LLMs generally show lower monitorability, and there exists a negative relationship between monitorability and model capability. Moreover, both open- and closed-source LLMs can intentionally reduce monitorability under stress-tests, with monitorability dropping by up to 30% in some tasks that do not require structural reasoning over the decision-critical factors. Beyond these empirical insights, MonitorBench provides a basis for further research on evaluating future LLMs, studying advanced stress-test monitorability techniques, and developing new monitoring approaches.