AI Navigate

Seeking Universal Shot Language Understanding Solutions

arXiv cs.LG / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SLU-SUITE, a large-scale training and evaluation suite with 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions.
  • It analyzes VLM-based shot language understanding (SLU) limitations from both model and data perspectives and motivates universal SLU solutions UniShot and AgentShots.
  • UniShot trains a generalist model via dynamic-balanced data mixing, while AgentShots uses a prompt-routed expert cluster to maximize peak dimension performance.
  • Experiments show the proposed models outperform task-specific ensembles on in-domain tasks and surpass leading commercial VLMs by 22% on out-of-domain tasks.

Abstract

Shot language understanding (SLU) is crucial for cinematic analysis but remains challenging due to its diverse cinematographic dimensions and subjective expert judgment. While vision-language models (VLMs) have shown strong ability in general visual understanding, recent studies reveal judgment discrepancies between VLMs and film experts on SLU tasks. To address this gap, we introduce SLU-SUITE, a comprehensive training and evaluation suite containing 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions. Using SLU-SUITE, we originally observe two insights into VLM-based SLU from: the model side, which diagnoses key bottlenecks of modules; the data side, which quantifies cross-dimensional influences among tasks. These findings motivate our universal SLU solutions from two complementary paradigms: UniShot, a balanced one-for-all generalist trained via dynamic-balanced data mixing, and AgentShots, a prompt-routed expert cluster that maximizes peak dimension performance. Extensive experiments show that our models outperform task-specific ensembles on in-domain tasks and surpass leading commercial VLMs by 22% on out-of-domain tasks.