Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence

arXiv cs.CL / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a unified approach to measure narrative coherence in visually grounded stories by comparing human-written narratives with outputs from vision-language models (VLMs) using the Visual Writing Prompts corpus.
  • It defines a narrative coherence score based on multiple dimensions, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding.
  • Results show that VLM-generated narratives have broadly similar coherence “profiles” to humans but differ systematically in how discourse is organized across the visual story.
  • While individual coherence differences can be subtle, the study finds they become more apparent when the metrics are evaluated jointly.
  • The authors provide accompanying code publicly on GitHub to support replication and further coherence-driven evaluation.

Abstract

We study narrative coherence in visually grounded stories by comparing human-written narratives with those generated by vision-language models (VLMs) on the Visual Writing Prompts corpus. Using a set of metrics that capture different aspects of narrative coherence, including coreference, discourse relation types, topic continuity, character persistence, and multimodal character grounding, we compute a narrative coherence score. We find that VLMs show broadly similar coherence profiles that differ systematically from those of humans. In addition, differences for individual measures are often subtle, but they become clearer when considered jointly. Overall, our results indicate that, despite human-like surface fluency, model narratives exhibit systematic differences from those of humans in how they organise discourse across a visually grounded story. Our code is available at https://github.com/GU-CLASP/coherence-driven-humans.