StoryScope: Investigating idiosyncrasies in AI fiction

arXiv cs.CL / 4/6/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces StoryScope, a pipeline that induces an interpretable feature space of discourse-level narrative choices (10 dimensions) to distinguish AI-generated fiction from human writing without relying on surface stylistic signals.
Using a dataset of 10,272 prompts written by humans and five LLMs (61,608 stories, ~5,000 words each), the approach achieves 93.2% macro-F1 for human-vs-AI detection using narrative features alone and 68.4% macro-F1 for six-way authorship attribution.
A small set of 30 “core” narrative features captures most of the detection signal, with AI stories tending to over-explain themes and use tidy single-track plots, while human stories show more morally ambiguous protagonist choices and higher temporal complexity.
The authors also report per-model “fingerprint” narrative features that differentiate specific LLMs (e.g., Claude’s flat event escalation, GPT’s emphasis on dream sequences, and Gemini’s preference for external character description).
Overall, the findings suggest that underlying narrative construction patterns (not just writing style) can meaningfully separate human-authored original fiction from AI-generated text.

Abstract

As AI-generated fiction becomes increasingly prevalent, questions of authorship and originality are becoming central to how written work is evaluated. While most existing work in this space focuses on identifying surface-level signatures of AI writing, we ask instead whether AI-generated stories can be distinguished from human ones without relying on stylistic signals, focusing on discourse-level narrative choices such as character agency and chronological discontinuity. We propose StoryScope, a pipeline that automatically induces a fine-grained, interpretable feature space of discourse-level narrative features across 10 dimensions. We apply StoryScope to a parallel corpus of 10,272 writing prompts, each written by a human author and five LLMs, yielding 61,608 stories, each ~5,000 words, and 304 extracted features per story. Narrative features alone achieve 93.2% macro-F1 for human vs. AI detection and 68.4% macro-F1 for six-way authorship attribution, retaining over 97% of the performance of models that include stylistic cues. A compact set of 30 core narrative features captures much of this signal: AI stories over-explain themes and favor tidy, single-track plots while human stories frame protagonist' choices as more morally ambiguous and have increased temporal complexity. Per-model fingerprint features enable six-way attribution: for example, Claude produces notably flat event escalation, GPT over-indexes on dream sequences, and Gemini defaults to external character description. We find that AI-generated stories cluster in a shared region of narrative space, while human-authored stories exhibit greater diversity. More broadly, these results suggest that differences in underlying narrative construction, not just writing style, can be used to separate human-written original works from AI-generated fiction.