ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

arXiv cs.AI / 4/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces ChartDiff, a first-of-its-kind large-scale benchmark focused on cross-chart comparative summarization rather than single-chart understanding.
  • ChartDiff includes 8,541 annotated pairs across varied chart types, data sources, and visual styles, with summaries covering differences in trends, fluctuations, and anomalies.
  • Evaluation across general-purpose, chart-specialized, and pipeline-based vision-language models finds that frontier general-purpose models score highest on GPT-based quality, while specialized/pipeline methods score higher on ROUGE but lower in human alignment.
  • The study shows multi-series chart comparisons remain difficult across model families, while strong end-to-end models are more robust to changes in plotting libraries.
  • Overall, the authors conclude that comparative chart reasoning is still a major challenge for current vision-language models and propose ChartDiff as a new research direction benchmark for multi-chart understanding.

Abstract

Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.