VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

arXiv cs.CV / 5/5/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • Existing VLM benchmarks often test spatio-temporal understanding on overly simple, single-action videos with closed vocabularies, which misses the open-ended, multi-entity, multi-action interactions found in real-world video understanding.
  • The paper introduces VISTA, a new interaction-aware benchmark that decomposes videos into entities, their actions, and relational dynamics to enable diagnostics across multiple spatio-temporal axes.
  • VISTA aggregates multiple datasets into a unified interaction-aware taxonomy and provides about 12K curated video-query pairs covering diverse scenes and complexities.
  • The authors evaluate 11 state-of-the-art VLMs on VISTA and show how taxonomy-based analysis can expose spatio-temporal biases and failure modes that traditional aggregate metrics can hide.
  • By offering detailed, taxonomy-driven diagnostics, VISTA aims to guide improvements in model design, pretraining strategies, and evaluation protocols for video-language spatio-temporal reasoning.

Abstract

Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.