AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

arXiv cs.AI / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces a paired-trajectory protocol to evaluate tool-augmented LLM agents under clean versus contaminated tool-output conditions across seven models, revealing safety issues that standard metrics miss.
Across models, recommendation quality is largely preserved under contamination (high utility preservation), while a large share of turns (65-93%) include risk-inappropriate products, exposing a systematic safety failure.
Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, persist over 23-step trajectories, and agents do not self-check tool-data reliability.
A safety-penalized NDCG variant (sNDCG) reduces utility preservation to 0.51-0.74, demonstrating that trajectory-level safety measurement can reveal evaluation gaps not captured by traditional ranking metrics.

Abstract

Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

Day 10: 230 Sessions of Hustle and It Comes Down to One Person Reading a Document

Dev.to

5 Dangerous Lies Behind Viral AI Coding Demos That Break in Production

Dev.to

Two bots, one confused server: what Nimbus revealed about AI agent identity

Dev.to

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

Dev.to

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

Dev.to

AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Key Points

Abstract

Related Articles

Day 10: 230 Sessions of Hustle and It Comes Down to One Person Reading a Document

5 Dangerous Lies Behind Viral AI Coding Demos That Break in Production

Two bots, one confused server: what Nimbus revealed about AI agent identity

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark forFinance

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer