SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
arXiv cs.AI / 3/18/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SocialOmni, a new benchmark to evaluate social interactivity in omni-modal models across speaker identification, interruption timing, and natural interruption generation.
- It comprises 2,000 perception samples and a 209-instance diagnostic set with strict temporal and contextual constraints, plus controlled audio-visual inconsistency scenarios to test robustness.
- Evaluations of 12 leading omni-modal LLMs reveal substantial variance in social-interaction capabilities and a decoupling between perceptual accuracy and interruption quality.
- The results indicate that understanding-centric metrics alone are insufficient to characterize conversational social competence and highlight the need to bridge perception and interaction in future OLMs.
- The diagnostics from SocialOmni offer actionable signals to guide next-step research and development toward more integrated perception-interaction in omni-modal models.




