RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees
arXiv cs.CL / 4/20/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces RefereeBench, a large-scale, human-annotated benchmark to evaluate whether multimodal LLMs can act as automatic sports referees across 11 sports using 925 curated videos and 6,475 QA pairs.
- It assesses five key officiating abilities—foul existence, classification, reasoning, entity perception, and temporal grounding—to test rule-grounded, multimodal decision-making rather than generic video understanding.
- Evaluations of leading models (including Doubao-Seed-1.8 and Gemini-3-Pro) show only about ~60% accuracy, and even the best open-source result (Qwen3-VL) reaches about 47%, indicating limited reliability.
- Analysis finds models are better at detecting incidents and entities, but they commonly fail on rule application and temporal grounding and often over-call fouls on normal clips.
- The benchmark is positioned as evidence that future MLLMs must better integrate domain knowledge with multimodal understanding to enable trustworthy AI-assisted officiating and broader multimodal decision-making.
Related Articles

From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to

GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to

Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial

Is Your Skill Actually Good? Systematically Validating Agent Skills with Evals
Dev.to