Spoiler Alert: Narrative Forecasting as a Metric for Tension in LLM Storytelling

arXiv cs.CL / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that current LLM creativity benchmarks (like EQ-Bench) miss a critical dimension of compelling stories—narrative tension—and that judges/rubrics can incorrectly prefer AI-generated stories over top human fiction.
  • It introduces the “100-Endings” metric, which uses sentence-by-sentence prediction of how a story will end (100 times per position) and defines tension as the frequency with which the model’s predictions fail to match the true continuation.
  • The approach goes beyond mismatch rate by analyzing the sentence-level tension curve, including statistics such as inflection rate to capture twists and revelations.
  • In reported evaluation, 100-Endings ranks New Yorker short stories higher than zero-shot LLM outputs, and the metric is used to design an LLM story-generation pipeline with structural constraints.
  • The authors claim their constrained generation pipeline increases narrative tension per 100-Endings while retaining strong performance on the EQ-Bench leaderboard.

Abstract

LLMs have so far failed both to generate consistently compelling stories and to recognize this failure--on the leading creative-writing benchmark (EQ-Bench), LLM judges rank zero-shot AI stories above New Yorker short stories, a gold standard for literary fiction. We argue that existing rubrics overlook a key dimension of compelling human stories: narrative tension. We introduce the 100-Endings metric, which walks through a story sentence by sentence: at each position, a model predicts how the story will end 100 times given only the text so far, and we measure tension as how often predictions fail to match the ground truth. Beyond the mismatch rate, the sentence-level curve yields complementary statistics, such as inflection rate, a geometric measure of how frequently the curve reverses direction, tracking twists and revelations. Unlike rubric-based judges, 100-Endings correctly ranks New Yorker stories far above LLM outputs. Grounded in narratological principles, we design a story-generation pipeline using structural constraints, including analysis of story templates, idea formulation, and narrative scaffolding. Our pipeline significantly increases narrative tension as measured by the 100-Endings metric, while maintaining performance on the EQ-Bench leaderboard.