SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

arXiv cs.CV / 4/29/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • The paper introduces SIV-Bench, a new video benchmark designed to evaluate multimodal large language models’ social interaction abilities end-to-end across social scene understanding, social state reasoning, and social dynamics prediction.
  • The benchmark includes 2,792 video clips and 5,455 human–LLM collaboratively generated question–answer pairs, spanning varied relationship types, video lengths, genres, presentation styles, and linguistic/cultural contexts.
  • Experiments on leading MLLMs show they perform comparatively well on social scene understanding, but are notably weak on social state reasoning and social dynamics prediction.
  • The authors identify relation inference—especially confusion in inferring relationships—as a key bottleneck, and further attribute failures to misalignment with human thought and insufficient reasoning depth.
  • They also find that audio and subtitles improve performance on reasoning-intensive tasks (SSR and SDP), and they release the dataset and code for future research use.

Abstract

Understanding social interaction, which encompasses perceiving numerous and subtle multimodal cues, inferring unobservable mental states and relations, and dynamically predicting others' behavior, is the foundation for achieving human-machine interaction. Despite rapid advances in Multimodal Large Language Models (MLLMs), the rich and multifaceted nature of social interaction has hindered the development of benchmarks that holistically evaluate and guide their social interaction abilities. Based on social relation theory, which has been widely regarded as a foundational framework for understanding social behavior, we provide SIV-Bench, a novel video benchmark for systematically evaluating MLLMs' capabilities across Social Scene Understanding (SSU), Social State Reasoning (SSR), and Social Dynamics Prediction (SDP). SIV-Bench features 2,792 originally collected video clips and 5,455 meticulously generated question-answer pairs derived from a human-LLM collaborative pipeline. It covers 14 typical relationships, diverse video lengths, genres, presentation styles, and linguistic and cultural backgrounds. Our comprehensive experiments show that leading MLLMs perform relatively well on SSU but remain weak on SSR and SDP, with the systematic confusion in relation inference as a key bottleneck. An in-depth analysis of the reasoning process attributes MLLMs' suboptimal performance to misalignment with human thoughts and insufficient reasoning depth. Moreover, we find audio and subtitles aid in reasoning-intensive SSR and SDP. Together, SIV-Bench offers a unified testbed to measure progress, expose limitations, and guide future research toward more socially intelligent MLLMs. We release the dataset and code at our project website: https://kfq20.github.io/sivbench.