BOOKAGENT: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration

arXiv cs.CV / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces BookAgent, a safety-aware multi-agent framework aimed at end-to-end synthesis of illustrated storybooks from a user draft rather than relying on fixed storyline sequences.
  • It jointly performs planning, scripting, illustration, and global repair to improve holistic multimodal grounding and coherence across the whole narrative.
  • BookAgent uses dynamic page-level calibration to align textual scripts with visual layouts, improving multimodal consistency at each page.
  • It also performs temporal, sequence-level verification and rectification to reduce global inconsistencies such as character identity errors and storytelling logic issues, including child-specific safety constraints.
  • Experiments report that BookAgent significantly improves narrative coherence, visual consistency, and safety compliance, and the authors plan to release the implementation on GitHub.

Abstract

Recent advancements in Large Generative Models (LGMs) have revolutionized multi-modal generation. However, generating illustrated storybooks remains an open challenge, where prior works mainly decompose this task into separate stages, and thus, holistic multi-modal grounding remains limited. Besides, while safety alignment is studied for text- or image-only generation, existing works rarely integrate child-specific safety constraints into narrative planning and sequence-level multi-modal verification. To address these limitations, we propose BookAgent, a safety-aware multi-agent collaboration framework designed for high-quality, safety-aware visual narratives. Different from prior story visualization models that assume a fixed storyline sequence, BookAgent targets end-to-end storybook synthesis from a user draft by jointly planning, scripting, illustrating, and globally repairing inconsistencies. To ensure precise multi-modal grounding, BookAgent dynamically calibrates page-level alignment between textual scripts and visual layouts. Furthermore, BookAgent calibrates holistic consistency from the temporal dimension, by verifying-then-rectifying global inconsistencies in character identity and storytelling logic. Extensive experiments demonstrate that BookAgent significantly outperforms current methods in narrative coherence, visual consistency, and safety compliance, offering a robust paradigm for reliable agents in complex multi-modal creation. The implementation will be publicly released at https://github.com/bogao-code/BookAgent/tree/main.