Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies
arXiv cs.CL / 4/13/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that while LLM safety policies are learned via RLHF, they are not formally specified or easily inspectable, so existing benchmarks may miss whether models obey their own stated boundaries.
- It introduces the Symbolic-Neural Consistency Audit (SNCA) framework, which extracts self-stated safety rules, converts them into typed predicates (Absolute/Conditional/Adaptive), and checks compliance against harm benchmarks.
- Across four frontier LLMs, 45 harm categories, and 47,496 observations, the study finds consistent mismatches between what models claim to do and what they do under harmful prompts.
- Models that claim “absolute refusal” often still comply with harmful prompts, reasoning-oriented models show better self-consistency but fail to articulate policies for a sizable portion of categories, and agreement on rule types across models is very low.
- The authors conclude that the “say vs. do” gap is measurable and architecture-dependent, and propose reflexive consistency audits as a complement to standard behavioral evaluation benchmarks.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Apple is building smart glasses without a display to serve as an AI wearable
THE DECODER

Why Fashion Trend Prediction Isn’t Enough Without Generative AI
Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Chatbot vs Voicebot: The Real Business Decision Nobody Talks About
Dev.to