Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PolyChartQA, a benchmark dataset for question answering over multi-chart images, designed to better reflect real-world needs to interpret multiple related charts together.
  • PolyChartQA includes 534 multi-chart images with 2,297 sub-charts and provides 2,694 question–answer pairs drawn from peer-reviewed computer science research publications.
  • The authors evaluate nine state-of-the-art multimodal language models on PolyChartQA, analyzing performance by question type, difficulty, question source, and structural properties of multi-charts.
  • Results indicate a 27.4% drop in LLM-based accuracy on human-authored questions versus model-generated questions, highlighting a gap in robustness to human-style QA.
  • The study also reports a 5.39% accuracy improvement using a proposed prompting method, suggesting practical prompt strategies can enhance multi-chart QA performance.

Abstract

Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.