FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

arXiv cs.AI / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Chain-of-Thought (CoT) can look convincing while using unfaithful intermediate steps, making existing self-evaluation methods unreliable due to coherence-bias effects.
FACT-E introduces a causality-inspired evaluation approach using controlled perturbations to more reliably measure intra-chain faithfulness (true step-to-step dependence).
The method selects more trustworthy reasoning trajectories by jointly optimizing intra-chain faithfulness and CoT-to-answer consistency.
Experiments on GSM8K, MATH, and CommonsenseQA show FACT-E improves the selection of reasoning trajectories and strengthens in-context learning exemplars.
FACT-E also demonstrates robustness by detecting flawed reasoning more reliably under noisy conditions, offering a sturdier metric for trustworthy LLM reasoning.

Abstract

Chain-of-Thought (CoT) prompting has improved LLM reasoning, but models often generate explanations that appear coherent while containing unfaithful intermediate steps. Existing self-evaluation approaches are prone to inherent biases: the model may confidently endorse coherence even when the step-to-step implication is not valid, leading to unreliable faithfulness evaluation. We propose FACT-E, a causality-inspired framework for evaluating CoT quality. FACT-E uses controlled perturbations as an instrumental signal to separate genuine step-to-step dependence from bias-driven artifacts, producing more reliable faithfulness estimates (\textit{intra-chain faithfulness}). To select trustworthy trajectories, FACT-E jointly considers \textit{intra-chain faithfulness} and \textit{CoT-to-answer consistency}, ensuring that selected chains are both faithful internally and supportive of the correct final answer. Experiments on GSM8K, MATH, and CommonsenseQA show that FACT-E improves reasoning-trajectory selection and yields stronger in-context learning exemplars. FACT-E also reliably detects flawed reasoning under noisy conditions, providing a robust metric for trustworthy LLM reasoning.

Black Hat Asia

AI Business

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Don't forget, there is more than forgetting: new metrics for Continual Learning

Dev.to

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Dev.to

Bit of a strange question?

Reddit r/artificial

FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning

Key Points

Abstract

Related Articles

Black Hat Asia

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Don't forget, there is more than forgetting: new metrics for Continual Learning

Microsoft MAI-Image-2-Efficient Review 2026: The AI Image Model Built for Production Scale

Bit of a strange question?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer