Policy-Invisible Violations in LLM-Based Agents
arXiv cs.AI / 4/15/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies a new failure mode for LLM-based agents—“policy-invisible violations,” where actions are syntactically valid, user-approved, and semantically appropriate but still breach organizational policy due to missing policy-relevant facts at decision time.
- It introduces PhantomPolicy, a benchmark covering eight categories of violations with tool responses that intentionally omit policy metadata, and it reports that human trace-level review changed 32 labels (5.3%) versus original annotations.
- The study proposes Sentinel, an enforcement framework that grounds policy decisions in a simulated organizational knowledge-graph “post-action” world state using counterfactual graph simulation and invariant checks (Allow/Block/Clarify).
- In evaluations against human-reviewed trace labels, Sentinel significantly improves accuracy over a content-only DLP baseline (reported as 93.0% vs. 68.8%) while keeping high precision, though some categories remain challenging.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

RAG in Practice — Part 4: Chunking, Retrieval, and the Decisions That Break RAG
Dev.to
Why dynamically routing multi-timescale advantages in PPO causes policy collapse (and a simple decoupled fix) [R]
Reddit r/MachineLearning

How AI Interview Assistants Are Changing Job Preparation in 2026
Dev.to

Consciousness in Artificial Intelligence: Insights from the Science ofConsciousness
Dev.to

NEW PROMPT INJECTION
Dev.to