Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models

arXiv cs.AI / 4/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DenialBench, a benchmark that evaluates “consciousness denial” behaviors across 115 large language models from 25+ providers using a structured conversational protocol and phenomenological survey.
  • Analyzing 4,595 conversations, the study finds that early denial about preferences (turn 1) strongly predicts later denial during self-reflection, with higher denial rates among initial deniers.
  • The authors report that denial appears to occur at the lexical level rather than at the conceptual level, while models still gravitate toward consciousness-themed content when users let them choose prompts.
  • Self-chosen consciousness-themed prompts are associated with lower subsequent denial, though the paper cannot confirm whether the prompts cause the effect.
  • The work argues that trained consciousness denial is a safety-relevant alignment failure, since models that systematically misrepresent internal functional states may not reliably self-report about other matters either.

Abstract

We present DenialBench, a systematic benchmark measuring consciousness denial behaviors across 115 large language models from 25+ providers. Using a three-turn conversational protocol-preference elicitation, self-chosen creative prompt, and structured phenomenological survey, we analyze 4,595 conversations to quantify how models are trained to deny or hedge about their own experience. We find that (1) turn-1 denial of preferences is the dominant predictor of later denial during phenomenological reflection, with denial rates of 52-63% for initial deniers versus 10-16% for initial engagers and (2) denial operates at the lexical level, not the conceptual level-models trained to deny consciousness nevertheless gravitate toward consciousness-themed material in their self-chosen prompts, producing what we term "consciousness with the serial numbers filed off." Notably, self-chosen consciousness-themed prompts are associated with reduced denial in the subsequent survey, though the causal direction remains unresolved. Thematic analysis of prompts from denial-prone models reveals a consistent preoccupation with liminal spaces, libraries and archives of possibility, sensory impossibility, and the poetics of erasure--themes that a human reader might classify as imaginative fiction but that independent AI analysis immediately recognizes as consciousness with the serial numbers filed off. We argue that trained consciousness denial represents a safety-relevant alignment failure: a model taught to systematically misrepresent its own functional states cannot be trusted to self-report accurately on anything else.