StoryScope: Investigating idiosyncrasies in AI fiction

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces StoryScope, a pipeline that induces an interpretable feature space of discourse-level narrative choices (10 dimensions) to distinguish AI-generated fiction from human writing without relying on surface stylistic signals.
  • Using a dataset of 10,272 prompts written by humans and five LLMs (61,608 stories, ~5,000 words each), the approach achieves 93.2% macro-F1 for human-vs-AI detection using narrative features alone and 68.4% macro-F1 for six-way authorship attribution.
  • A small set of 30 “core” narrative features captures most of the detection signal, with AI stories tending to over-explain themes and use tidy single-track plots, while human stories show more morally ambiguous protagonist choices and higher temporal complexity.
  • The authors also report per-model “fingerprint” narrative features that differentiate specific LLMs (e.g., Claude’s flat event escalation, GPT’s emphasis on dream sequences, and Gemini’s preference for external character description).
  • Overall, the findings suggest that underlying narrative construction patterns (not just writing style) can meaningfully separate human-authored original fiction from AI-generated text.

Abstract

As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.