DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents

arXiv cs.CL / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Introduces DECEPTGUARD, a unified framework for detecting deception in LLM agents by comparing black-box monitors, chain-of-thought (CoT)-aware monitors, and activation-probe monitors.
Proposes DECEPTSYNTH, a scalable pipeline that generates deception-positive and deception-negative trajectories across a 12-category taxonomy for robust evaluation.
Demonstrates that CoT-aware and activation-probe monitors substantially outperform black-box monitors, with a mean pAUROC improvement of +0.097, especially for subtle, long-horizon deception.
Advances a HYBRID-CONSTITUTIONAL ensemble approach that achieves a pAUROC of 0.934 on held-out data, indicating a strong defense-in-depth capability against deceptive LLM behavior.

Abstract

Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high-stakes agentic contexts. Prior work on scheming detection has focused exclusively on black-box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace), and activation-probe monitors (additionally reading hidden-state representations from a frozen open-weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception. Our monitors are optimized on 4,800 synthetic trajectories and evaluated on 9,200 held-out samples from DeceptArena, a benchmark of realistic sandboxed agent environments with execution-verified labels. Across all evaluation settings, CoT-aware and activation-probe monitors substantially outperform their black-box counterparts (mean pAUROC improvement of +0.097), with the largest gains on subtle, long-horizon deception that leaves minimal behavioral footprints. We empirically characterize a transparency-detectability trade-off: as agents learn to suppress overt behavioral signals, chain-of-thought becomes the primary detection surface but is itself increasingly unreliable due to post-training faithfulness degradation. We propose HYBRID-CONSTITUTIONAL ensembles as a robust defense-in-depth approach, achieving a pAUROC of 0.934 on the held-out test set, representing a substantial advance over the prior state of the art.

Day 10: 230 Sessions of Hustle and It Comes Down to One Person Reading a Document

Dev.to

5 Dangerous Lies Behind Viral AI Coding Demos That Break in Production

Dev.to

Two bots, one confused server: what Nimbus revealed about AI agent identity

Dev.to

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

Dev.to

What is MCP?

Dev.to

DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents

Key Points

Abstract

Related Articles

Day 10: 230 Sessions of Hustle and It Comes Down to One Person Reading a Document

5 Dangerous Lies Behind Viral AI Coding Demos That Break in Production

Two bots, one confused server: what Nimbus revealed about AI agent identity

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

What is MCP?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer