DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

arXiv cs.AI / 4/25/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper introduces DiagramBank, a large-scale dataset of 89,422 schematic scientific diagrams paired with paper metadata to support retrieval-augmented generation of publication-quality figures.
  • DiagramBank is designed to address a key bottleneck in end-to-end “AI scientist” systems: generating teaser/strategic diagrams rather than relying on missing components or low-quality plot substitutes.
  • The dataset is built via an automated curation pipeline that extracts figures and their in-text figure references, then uses a CLIP-based filter to separate schematic diagrams from standard plots and natural images.
  • Each diagram instance is linked with contextual text (e.g., from abstract and caption) plus figure-reference pairs, enabling retrieval at multiple query granularities.
  • The authors release DiagramBank in an index-ready format along with a retrieval-augmented generation codebase to demonstrate exemplar-conditioned teaser figure synthesis.

Abstract

Recent advances in autonomous ``AI scientist'' systems have demonstrated the ability to automatically write scientific manuscripts and codes with execution. However, producing a publication-grade scientific diagram (e.g., teaser figure) is still a major bottleneck in the ``end-to-end'' paper generation process. For example, a teaser figure acts as a strategic visual interface and serves a different purpose than derivative data plots. It demands conceptual synthesis and planning to translate complex logic workflow into a compelling graphic that guides intuition and sparks curiosity. Existing AI scientist systems usually omit this component or fall back to an inferior alternative. To bridge this gap, we present DiagramBank, a large-scale dataset consisting of 89,422 schematic diagrams curated from existing top-tier scientific publications, designed for multimodal retrieval and exemplar-driven scientific figure generation. DiagramBank is developed through our automated curation pipeline that extracts figures and corresponding in-text references, and uses a CLIP-based filter to differentiate schematic diagrams from standard plots or natural images. Each instance is paired with rich context from abstract, caption, to figure-reference pairs, enabling information retrieval under different query granularities. We release DiagramBank in a ready-to-index format and provide a retrieval-augmented generation codebase to demonstrate exemplar-conditioned synthesis of teaser figures. DiagramBank is publicly available at https://huggingface.co/datasets/zhangt20/DiagramBank with code at https://github.com/csml-rpi/DiagramBank.