Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes “Graph-to-Vision,” a benchmark to evaluate Vision-Language Models’ ability to perform joint reasoning across multiple graphs, an area not well covered by prior single-graph studies.
  • The benchmark spans four common graph types (knowledge graphs, flowcharts, mind maps, and route maps) and supports both homogeneous and heterogeneous groupings with tasks that increase in complexity.
  • Evaluation is done using a multi-dimensional scoring scheme covering graph parsing quality, reasoning consistency, and instruction-following accuracy, applied to several state-of-the-art VLMs.
  • The authors fine-tune multiple open-source VLMs and find consistent gains, suggesting the dataset effectively drives better multi-graph understanding.
  • Overall, the work lays groundwork for advancing cross-modal graph intelligence beyond traditional Graph Neural Networks (GNNs).

Abstract

Recent advances in Vision-Language Models (VLMs) have shown promising capabilities in interpreting visualized graph data, offering a new perspective for graph-structured reasoning beyond traditional Graph Neural Networks (GNNs). However, existing studies focus primarily on single-graph reasoning, leaving the critical challenge of multi-graph joint reasoning underexplored. In this work, we introduce the first comprehensive benchmark designed to evaluate and enhance the multi-graph reasoning abilities of VLMs. Our benchmark covers four common graph types-knowledge graphs, flowcharts, mind maps, and route maps-and supports both homogeneous and heterogeneous graph groupings with tasks of increasing complexity. We evaluate several state-of-the-art VLMs under a multi-dimensional scoring framework that assesses graph parsing, reasoning consistency, and instruction-following accuracy. Additionally, we fine-tune multiple open-source models and observe consistent improvements, confirming the effectiveness of our dataset. This work provides a principled step toward advancing multi-graph understanding and reveals new opportunities for cross-modal graph intelligence.