DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents

arXiv cs.CL / 3/17/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Introduces DECEPTGUARD, a unified framework for detecting deception in LLM agents by comparing black-box monitors, chain-of-thought (CoT)-aware monitors, and activation-probe monitors.
Proposes DECEPTSYNTH, a scalable pipeline that generates deception-positive and deception-negative trajectories across a 12-category taxonomy for robust evaluation.
Demonstrates that CoT-aware and activation-probe monitors substantially outperform black-box monitors, with a mean pAUROC improvement of +0.097, especially for subtle, long-horizon deception.
Advances a HYBRID-CONSTITUTIONAL ensemble approach that achieves a pAUROC of 0.934 on held-out data, indicating a strong defense-in-depth capability against deceptive LLM behavior.

Abstract

Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high-stakes agentic contexts. Prior work on scheming detection has focused exclusively on black-box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace), and activation-probe monitors (additionally reading hidden-state representations from a frozen open-weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception. Our monitors are optimized on 4,800 synthetic trajectories and evaluated on 9,200 held-out samples from DeceptArena, a benchmark of realistic sandboxed agent environments with execution-verified labels. Across all evaluation settings, CoT-aware and activation-probe monitors substantially outperform their black-box counterparts (mean pAUROC improvement of +0.097), with the largest gains on subtle, long-horizon deception that leaves minimal behavioral footprints. We empirically characterize a transparency-detectability trade-off: as agents learn to suppress overt behavioral signals, chain-of-thought becomes the primary detection surface but is itself increasingly unreliable due to post-training faithfulness degradation. We propose HYBRID-CONSTITUTIONAL ensembles as a robust defense-in-depth approach, achieving a pAUROC of 0.934 on the held-out test set, representing a substantial advance over the prior state of the art.