AI Navigate

Directing the Narrative: A Finetuning Method for Controlling Coherence and Style in Story Generation

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • It proposes a two-stage framework for story generation that combines Group-Shared Attention (GSA) and Direct Preference Optimization (DPO) to improve coherence and style.
  • Group-Shared Attention enables lossless cross-sample information flow within attention layers to encode identity consistency across frames without relying on external encoders.
  • Direct Preference Optimization aligns generated outputs with human aesthetic and narrative standards by learning from holistic preference data rather than conflicting auxiliary losses.
  • On ViStoryBench, the approach achieves state-of-the-art results with +10.0 gains in Character Identity (CIDS) and +18.7 gains in Style Consistency (CSD) while preserving high-fidelity generation.

Abstract

Story visualization requires generating sequential imagery that aligns semantically with evolving narratives while maintaining rigorous consistency in character identity and visual style. However, existing methodologies often struggle with subject inconsistency and identity drift, particularly when depicting complex interactions or extended narrative arcs. To address these challenges, we propose a cohesive two-stage framework designed for robust and consistent story generation. First, we introduce Group-Shared Attention (GSA), a mechanism that fosters intrinsic consistency by enabling lossless cross-sample information flow within attention layers. This allows the model to structurally encode identity correspondence across frames without relying on external encoders. Second, we leverage Direct Preference Optimization (DPO) to align generated outputs with human aesthetic and narrative standards. Unlike conventional methods that rely on conflicting auxiliary losses, our approach simultaneously enhances visual fidelity and identity preservation by learning from holistic preference data. Extensive evaluations on the ViStoryBench benchmark demonstrate that our method establishes a new state-of-the-art, significantly outperforming strong baselines with gains of +10.0 in Character Identity (CIDS) and +18.7 in Style Consistency (CSD), all while preserving high-fidelity generation.