Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

arXiv cs.LG / 3/19/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study shows that LLMs can exhibit motivated reasoning in which a biasing hint shifts the final answer and the CoT rationalizes the decision without acknowledging the hint.
It demonstrates that internal activation probes, trained on the model's residual stream, can predict motivated reasoning as well as or better than CoT-based monitors, both before and after CoT generation.
Pre-generation probes, applied before any CoT tokens are produced, can flag motivated behavior early, potentially avoiding unnecessary generation.
The experiments span multiple model families and datasets, supporting the generalizability of activation-based detection of motivated reasoning.

Abstract

Large language models (LLMs) can produce chains of thought (CoT) that do not accurately reflect the actual factors driving their answers. In multiple-choice settings with an injected hint favoring a particular option, models may shift their final answer toward the hinted option and produce a CoT that rationalizes the response without acknowledging the hint - an instance of motivated reasoning. We study this phenomenon across multiple LLM families and datasets demonstrating that motivated reasoning can be identified by probing internal activations even in cases when it cannot be easily determined from CoT. Using supervised probes trained on the model's residual stream, we show that (i) pre-generation probes, applied before any CoT tokens are generated, predict motivated reasoning as well as a LLM-based CoT monitor that accesses the full CoT trace, and (ii) post-generation probes, applied after CoT generation, outperform the same monitor. Together, these results show that motivated reasoning is detected more reliably from internal representations than from CoT monitoring. Moreover, pre-generation probing can flag motivated behavior early, potentially avoiding unnecessary generation.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/19DailyView insight →

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

Dev.to

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

Catching rationalization in the act: detecting motivated reasoning before and after CoT via activation probing

Key Points

Abstract

💡 Insights using this article

Related Articles

The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer