Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

arXiv cs.CL / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces PolyChartQA, a benchmark dataset for question answering over multi-chart images, designed to better reflect real-world needs to interpret multiple related charts together.
PolyChartQA includes 534 multi-chart images with 2,297 sub-charts and provides 2,694 question–answer pairs drawn from peer-reviewed computer science research publications.
The authors evaluate nine state-of-the-art multimodal language models on PolyChartQA, analyzing performance by question type, difficulty, question source, and structural properties of multi-charts.
Results indicate a 27.4% drop in LLM-based accuracy on human-authored questions versus model-generated questions, highlighting a gap in robustness to human-style QA.
The study also reports a 5.39% accuracy improvement using a proposed prompting method, suggesting practical prompt strategies can enhance multi-chart QA performance.

Abstract

Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively explored. We introduce PolyChartQA, a mid-scale dataset specifically designed for question answering over multi-chart images. PolyChartQA comprises 534 multi-chart images (with a total of 2,297 sub-charts) sourced from peer-reviewed computer science research publications and 2,694 QA pairs. We evaluate the performance of nine state-of-the-art Multimodal Language Models (MLMs) on PolyChartQA across question type, difficulty, question source, and key structural characteristics of multi-charts. Our results show a 27.4% LLM-based accuracy (L-Accuracy) drop on human-authored questions compared to MLM-generated questions, and a 5.39% L-accuracy gain with our proposed prompting method.

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Dev.to

Context Engineering for Developers: A Practical Guide (2026)

Dev.to

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

Dev.to

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Dev.to

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

Reddit r/LocalLLaMA

Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

Key Points

Abstract

Related Articles

The 67th Attempt: When Your "Knowledge Management" System Becomes a Self-Fulfilling Prophecy of Excellence

Context Engineering for Developers: A Practical Guide (2026)

GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.

I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)

Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer