SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models
arXiv cs.AI / 3/18/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SocialOmni, a new benchmark to evaluate social interactivity in omni-modal models across speaker identification, interruption timing, and natural interruption generation.
- It comprises 2,000 perception samples and a 209-instance diagnostic set with strict temporal and contextual constraints, plus controlled audio-visual inconsistency scenarios to test robustness.
- Evaluations of 12 leading omni-modal LLMs reveal substantial variance in social-interaction capabilities and a decoupling between perceptual accuracy and interruption quality.
- The results indicate that understanding-centric metrics alone are insufficient to characterize conversational social competence and highlight the need to bridge perception and interaction in future OLMs.
- The diagnostics from SocialOmni offer actionable signals to guide next-step research and development toward more integrated perception-interaction in omni-modal models.
Related Articles
State of MCP Security 2026: We Scanned 15,923 AI Tools. Here's What We Found.
Dev.to
Data Augmentation Using GANs
Dev.to
Building Safety Guardrails for LLM Customer Service That Actually Work in Production
Dev.to

The New AI Agent Primitive: Why Policy Needs Its Own Language (And Why YAML and Rego Fall Short)
Dev.to

The Digital Paralegal: Amplifying Legal Teams with a Copilot Co-Worker
Dev.to