Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

arXiv cs.CL / 3/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study demonstrates that faithfulness measurements in LLM chain-of-thought are not objective, showing large variation when different classifiers are used, with faithfulness rates of 74.4%, 82.6%, and 69.7% on identical data across three classifiers.
Disagreements among classifiers are systematic and can reverse model rankings, with inter-classifier agreement ranging from slight to moderate (Cohen's kappa 0.06 to 0.42) and examples where ranking flips occur between methods.
The root cause is that the classifiers operationalize different faithfulness constructs at varying stringency levels (lexical mention vs epistemic dependence), making cross-study faithfulness numbers incomparable.
The authors urge reporting sensitivity ranges across multiple classification methodologies in future evaluations rather than relying on a single point estimate to assess faithfulness.

Abstract

Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively, with non-overlapping 95% confidence intervals. Per-model gaps range from 2.6 to 30.6 percentage points; all are statistically significant (McNemar's test, p < 0.001). The disagreements are systematic, not random: inter-classifier agreement measured by Cohen's kappa ranges from 0.06 ("slight") for sycophancy hints to 0.42 ("moderate") for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction. Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd. The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior. These results demonstrate that published faithfulness numbers cannot be meaningfully compared across studies that use different classifiers, and that future evaluations should report sensitivity ranges across multiple classification methodologies rather than single point estimates.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/23DailyView insight →

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)

Dev.to

The Obligor

Dev.to

The Markup

Dev.to

2026 年 AI 部落格變現完整攻略：從第一篇文章到月收入 $1000

Dev.to

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Key Points

Abstract

💡 Insights using this article

Related Articles

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)

The Obligor

The Markup

2026 年 AI 部落格變現完整攻略：從第一篇文章到月收入 $1000

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer