EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces EgoEsportsQA, a new egocentric video question-answering benchmark designed to test perception and rule-bound reasoning in high-velocity, information-dense esports video settings.
EgoEsportsQA contains 1,745 QA pairs curated from professional first-person shooter matches using a scalable six-stage pipeline, and questions are organized with a two-dimensional taxonomy spanning cognitive sub-tasks and esports-knowledge sub-tasks.
Evaluations of state-of-the-art Video-LLMs show limited performance, with the best reported accuracy reaching only 71.58%, highlighting substantial weaknesses for tactical, fine-grained reasoning.
Analysis indicates models are stronger at basic visual perception than at deeper tactical reasoning, and they do better on macro-progression than on micro-operations.
Ablation and further investigation suggest the dataset can both reveal architectural limitations of current Video-LLMs and provide guidance for improving downstream esports-focused applications.

Abstract

While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.