Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

arXiv cs.CL / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • Vision-language models (VLMs) often misread chart values and hallucinate details because they rely on pixels alone, which prevents agents from using the chart’s underlying structured specification.
  • The paper proposes Introspective and Interactive Visual Grounding (IVG), combining spec-grounded introspection (querying deterministic evidence from the specification) with view-grounded interaction (adjusting the chart view to disambiguate visuals).
  • It introduces iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications, designed to reduce evaluation bias from the VLM itself.
  • Experiments show introspection improves data reconstruction fidelity, and that pairing it with interaction yields the best question-answering accuracy (0.81), especially improving performance on overlapping geometries.
  • The authors also demonstrate IVG in deployed visualization agents that autonomously explore data and collaborate with human users in real time.

Abstract

Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.