AI Navigate

TennisExpert: Towards Expert-Level Analytical Sports Video Understanding

arXiv cs.CV / 3/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • TennisVL is introduced as a large-scale benchmark for tennis understanding with 200+ matches (471.9 hours) and 40,000+ rally clips, emphasizing expert analytical commentary rather than descriptive narration.
  • TennisExpert is proposed as a multimodal framework that uses a video semantic parser and a memory-augmented model based on Qwen3-VL-8B to extract key elements such as scores, shot sequences, ball bounces, and player locations.
  • The parser plus hierarchical memory modules capture short- and long-term temporal context to better model tactical reasoning and match momentum, and the method outperforms strong proprietary baselines like GPT-5, Gemini, and Claude.
  • The work emphasizes real-time deployment potential and applications in automated coaching and real-time sports commentary.

Abstract

Tennis is one of the most widely followed sports, generating extensive broadcast footage with strong potential for professional analysis, automated coaching, and real-time commentary. However, automatic tennis understanding remains underexplored due to two key challenges: (1) the lack of large-scale benchmarks with fine-grained annotations and expert-level commentary, and (2) the difficulty of building accurate yet efficient multimodal systems suitable for real-time deployment. To address these challenges, we introduce TennisVL, a large-scale tennis benchmark comprising over 200 professional matches (471.9 hours) and 40,000+ rally-level clips. Unlike existing commentary datasets that focus on descriptive play-by-play narration, TennisVL emphasizes expert analytical commentary capturing tactical reasoning, player decisions, and match momentum. Furthermore, we propose TennisExpert, a multimodal tennis understanding framework that integrates a video semantic parser with a memory-augmented model built on Qwen3-VL-8B. The parser extracts key match elements (e.g., scores, shot sequences, ball bounces, and player locations), while hierarchical memory modules capture both short- and long-term temporal context. Experiments show that TennisExpert consistently outperforms strong proprietary baselines, including GPT-5, Gemini, and Claude, and demonstrates improved ability to capture tactical context and match dynamics.