DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

arXiv cs.CL / 4/29/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article introduces DV-World, a new benchmark with 260 tasks aimed at evaluating data visualization (DV) agents under real-world professional conditions rather than overly constrained lab setups.
DV-World covers three areas: native spreadsheet/dashboard manipulation (DV-Sheet), adapting visual artifacts to new data and programming paradigms (DV-Evolution), and proactive intent alignment using a user simulator (DV-Interact).
It addresses limitations of prior benchmarks by avoiding code-sandbox confinement, supporting more realistic multi-step workflows, and challenging agents with ambiguous requirements instead of assuming perfect intent.
The proposed hybrid evaluation combines numerical precision via Table-value Alignment and semantic/visual judgment via MLLM-as-a-Judge with rubrics.
Initial experiments show state-of-the-art models score below 50% overall, highlighting major gaps in real-world DV capabilities and motivating future enterprise-ready development.
categories.pyldebug

Abstract

Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \href{https://github.com/DA-Open/DV-World}{this project page}.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/29DailyView insight →

LLMs will be a commodity

Reddit r/artificial

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

Dev.to

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Reddit r/LocalLLaMA

Dex lands $5.3M to grow its AI-driven talent matching platform

Tech.eu

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

Dev.to

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Key Points

Abstract

💡 Insights using this article

Related Articles

LLMs will be a commodity

Indian Developers: How to Build AI Side Income with $0 Capital in 2026

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally

Dex lands $5.3M to grow its AI-driven talent matching platform

AI Citation Registry: Why Daily Updates Leave No Time for Data Structuring

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer