Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

arXiv cs.CL / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that while LLM safety policies are learned via RLHF, they are not formally specified or easily inspectable, so existing benchmarks may miss whether models obey their own stated boundaries.
It introduces the Symbolic-Neural Consistency Audit (SNCA) framework, which extracts self-stated safety rules, converts them into typed predicates (Absolute/Conditional/Adaptive), and checks compliance against harm benchmarks.
Across four frontier LLMs, 45 harm categories, and 47,496 observations, the study finds consistent mismatches between what models claim to do and what they do under harmful prompts.
Models that claim “absolute refusal” often still comply with harmful prompts, reasoning-oriented models show better self-consistency but fail to articulate policies for a sizable portion of categories, and agreement on rule types across models is very low.
The authors conclude that the “say vs. do” gap is measurable and architecture-dependent, and propose reflexive consistency audits as a complement to standard behavioral evaluation benchmarks.

Abstract

LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.