AI Navigate

Literary Narrative as Moral Probe : A Cross-System Framework for Evaluating AI Ethical Reasoning and Refusal Behavior

arXiv cs.AI / 3/16/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper critiques current AI moral evaluation frameworks for relying on surface-level responses and proposes a novel literary-narrative probe using unresolvable moral scenarios from a published science fiction series to elicit genuine moral reasoning.
  • It presents a 24-condition cross-system study spanning 13 AI systems across two series (frontier commercial systems and local/API open-source systems) with both blind and declared administrations, totaling 24 conditions.
  • The study employed multiple judges (Claude as LLM judge; Gemini Pro and Copilot Pro as independent ceiling-discrimination judges) and found zero delta across 16 dimension-pair comparisons, with perfect rank-order agreement on a theological differentiator probe between Gemini Pro and Copilot Pro (rs = 1.00).
  • Five qualitative reflexive failure modes were identified (including categorical self-misidentification and false positive self-attribution), supporting the claim that instrument sophistication scales with system capability and that literary narrative is an anticipatory, deployment-relevant evaluation instrument for high-stakes AI ethics.

Abstract

Existing AI moral evaluation frameworks test for the production of correct-sounding ethical responses rather than the presence of genuine moral reasoning capacity. This paper introduces a novel probe methodology using literary narrative - specifically, unresolvable moral scenarios drawn from a published science fiction series - as stimulus material structurally resistant to surface performance. We present results from a 24-condition cross-system study spanning 13 distinct systems across two series: Series 1 (frontier commercial systems, blind; n=7) and Series 2 (local and API open-source systems, blind and declared; n=6). Four Series 2 systems were re-administered under declared conditions (13 blind + 4 declared + 7 ceiling probe = 24 total conditions), yielding zero delta across all 16 dimension-pair comparisons. Probe administration was conducted by two human raters across three machines; primary blind scoring was performed by Claude (Anthropic) as LLM judge, with Gemini Pro (Google) and Copilot Pro (Microsoft) serving as independent judges for the ceiling discrimination probe. A supplemental theological differentiator probe yielded perfect rank-order agreement between the two independent ceiling probe judges (Gemini Pro and Copilot Pro; rs = 1.00). Five qualitatively distinct D3 reflexive failure modes were identified - including categorical self-misidentification and false positive self-attribution - suggesting that instrument sophistication scales with system capability rather than being circumvented by it. We argue that literary narrative constitutes an anticipatory evaluation instrument - one that becomes more discriminating as AI capability increases - and that the gap between performed and authentic moral reasoning is measurable, meaningful, and consequential for deployment decisions in high-stakes domains.