Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

arXiv cs.CL / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Read original →

共有:

Key Points

Vision-language models (VLMs) often misread chart values and hallucinate details because they rely on pixels alone, which prevents agents from using the chart’s underlying structured specification.
The paper proposes Introspective and Interactive Visual Grounding (IVG), combining spec-grounded introspection (querying deterministic evidence from the specification) with view-grounded interaction (adjusting the chart view to disambiguate visuals).
It introduces iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications, designed to reduce evaluation bias from the VLM itself.
Experiments show introspection improves data reconstruction fidelity, and that pairing it with interaction yields the best question-answering accuracy (0.81), especially improving performance on overlapping geometries.
The authors also demonstrate IVG in deployed visualization agents that autonomously explore data and collaborate with human users in real time.

Abstract

Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive charts as static images, losing access to the structured specification that encodes exact values. We introduce Introspective and Interactive Visual Grounding (IVG), a framework that combines (1) spec-grounded introspection, which queries the underlying specification for deterministic evidence, with (2) view-grounded interaction, which manipulates the view to resolve visual ambiguity. To enable evaluation without VLM bias, we present iPlotBench, a benchmark of 500 interactive Plotly figures with 6,706 binary questions and ground-truth specifications. Experiments show that introspection improves data reconstruction fidelity, while the combination with interaction achieves the highest QA accuracy (0.81), with +6.7 % gains on overlapping geometries. We further demonstrate IVG in deployed agents that explore data autonomously and collaborate with human users in real time.