From Skeletons to Pixels: Few-Shot Precise Event Spotting via Representation and Prediction Distillation

arXiv cs.CV / 4/28/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses Precise Event Spotting (PES) in fast sports like tennis, where accurate frame-level event localization is difficult due to motion blur, fine-grained action differences, and scarce annotations.
  • It proposes two few-shot distillation approaches: Adaptive Weight Distillation (AWD), which adaptively reweights teacher predictions on unlabeled data, and AMD-FED, which distills robust “skeleton” knowledge into visual representations via annealed pseudo-labeling.
  • Both methods rely on multimodal distillation to improve generalization when labeled data is limited, focusing on transferring useful information across modalities.
  • Experiments on F3Set-Tennis(sub) under k-clip few-shot settings show consistent gains over single-modality baselines and previous PES methods, and AMD-FED also performs robustly on Figure Skating.
  • The results indicate that representation-level multimodal distillation—especially skeleton-to-visual transfer—can be particularly effective for few-shot precise event spotting.

Abstract

Precise Event Spotting (PES) is essential in fast-paced sports such as tennis, where fine-grained events occur within very short temporal windows. Accurate frame-level localization is challenging because of motion blur, subtle action differences, and limited annotated data. We study two complementary distillation strategies for few-shot PES: Adaptive Weight Distillation (AWD), a prediction-level method that adaptively weights teacher supervision on unlabeled data, and Annealed Multimodal Distillation for Few-Shot Event Detection (AMD-FED), a representation-level framework that transfers robust skeleton knowledge into visual modalities through annealed pseudo-labeling. Both methods use multimodal distillation to improve generalization under limited supervision. We evaluate them on F3Set-Tennis(sub) under few-shot k-clip settings, where they consistently outperform single-modality baselines and prior PES approaches. After observing the stronger performance of representation-level distillation on tennis, we further validate AMD-FED on a second sports dataset, Figure Skating, where it also shows robust performance in the k-clip scenario. These results highlight the effectiveness of multimodal distillation, especially representation-level transfer, for few-shot precise event spotting.