RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

arXiv cs.CL / 3/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces RealChart2Code, a new large-scale benchmark (2,800+ instances) for evaluating vision-language model chart-to-code generation using authentic real-world datasets with analytical intent.
It emphasizes two challenging settings that prior benchmarks often miss: generating charts from large-scale raw data and improving code through iterative multi-turn conversations.
An evaluation of 14 leading VLMs shows substantial performance drops versus simpler benchmarks, indicating difficulty with complex plot structures and faithful replication from real data.
The authors find a notable performance gap between proprietary models and open-weight models, and report that even state-of-the-art systems frequently fail on intricate multi-panel chart replication.
The benchmark and associated code are released publicly to support follow-on research into chart generation, grounding, and multi-step code refinement.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf{\texttt{RealChart2Code}}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \texttt{RealChart2Code} reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \url{https://github.com/Speakn0w/RealChart2Code}.