SportSkills: Physical Skill Learning from Sports Instructional Videos

arXiv cs.CV / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces SportSkills, a large-scale in-the-wild sports instructional video dataset focused on physical skill learning rather than generic human activity recognition.
  • SportSkills contains 360k+ instructional videos and 630k+ visual demonstrations across 55 sports, paired with instructional narrations that explain the know-how behind actions.
  • Experiments show that training on SportSkills improves fine-grained understanding of physical actions, with reported representation gains of up to 4x compared with traditional activity-centric datasets using the same model.
  • The authors also propose mistake-conditioned instructional video retrieval, enabling models to map a user’s execution and query to relevant clips for improvement (i.e., actionable feedback).
  • Professional coach evaluations indicate the retrieval approach substantially improves personalization of visual instructions based on the user query.

Abstract

Current large-scale video datasets focus on general human activity, but lack depth of coverage on fine-grained activities needed to address physical skill learning. We introduce SportSkills, the first large-scale sports dataset geared towards physical skill learning with in-the-wild video. SportSkills has more than 360k instructional videos containing more than 630k visual demonstrations paired with instructional narrations explaining the know-how behind the actions from 55 varied sports. Through a suite of experiments, we show that SportSkills unlocks the ability to understand fine-grained differences between physical actions. Our representation achieves gains of up to 4x with the same model trained on traditional activity-centric datasets. Crucially, building on SportSkills, we introduce the first large-scale task formulation of mistake-conditioned instructional video retrieval, bridging representation learning and actionable feedback generation (e.g., "here's my execution of a skill; which video clip should I watch to improve it?"). Formal evaluations by professional coaches show our retrieval approach significantly advances the ability of video models to personalize visual instructions for a user query.