MambaKick: Early Penalty Direction Prediction from HAR Embeddings

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces MambaKick, a framework for predicting soccer penalty kick shot direction under strict time constraints by using contact-centered short video segments.
  • Instead of reconstructing kinematics or using handcrafted biomechanical features, it reuses pretrained human action recognition (HAR) spatiotemporal embeddings and feeds them into a lightweight temporal predictor based on Mamba (selective state-space models).
  • The method also incorporates simple contextual metadata such as field side and footedness to reduce ambiguity in real-world footage.
  • Across multiple HAR backbones, MambaKick improves or matches strong embedding baselines, reaching up to 53.1% accuracy for three classes and 64.5% for two classes.
  • The authors suggest the approach supports practical, low-latency intention prediction for sports video, and they plan to release code on GitHub.

Abstract

Penalty kicks in soccer are decided under extreme time constraints, where goalkeepers benefit from anticipating shot direction from the kickers motion before or around ball contact. In this paper, MambaKick is presented as a learning-based framework for penalty direction prediction that leverages pretrained human action recognition (HAR) embeddings extracted from contact-centered short video segments and combines them with a lightweight temporal predictor. Rather than relying on explicit kinematic reconstruction or handcrafted biomechanical features, the approach reuses transferable spatiotemporal representations and utilizes selective state-spare models (Mamba) for efficient sequence aggregation. Simple contextual metadata (e.g., field side and footedness) are also considered as complementary cues that may reduce ambiguity in real-world footage. Across a range of HAR backbones, MambaKick consistently improves or matches strong embedding baselines, achieving up to 53.1% accuracy for three classes and 64.5% for two classes under the proposed methodology. Overall, the results indicate that combining pretrained HAR representations with efficient state-space temporal modeling is a practical direction for low-latency intention prediction in real-world sports video. The code will be available at GitHub: https://github.com/hvelesaca/MambaKick/