CEZSAR: A Contrastive Embedding Method for Zero-Shot Action Recognition

arXiv cs.CV / 5/5/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CEZSAR, a zero-shot action recognition (ZSAR) method that uses contrastive learning to classify action classes not seen during training.
  • It targets two core challenges in ZSAR—semantic gaps between text-derived label representations and visual features, and domain shift caused by differences between unknown test sets and training data.
  • CEZSAR learns a joint embedding space by encoding videos and sentences and aligning videos with their natural-language descriptions.
  • To improve training, the authors propose an automatic negative sampling strategy that creates additional unpaired (visual appearance with unrelated descriptions) examples for contrastive learning.
  • Experiments report state-of-the-art performance on UCF-101 and Kinetics-400 across multiple split settings, and the code is released on GitHub.

Abstract

This paper proposes a novel Zero-Shot Action Recognition~(ZSAR) method based on contrastive learning. In ZSAR, we aim to classify examples from classes that were missing during training. Two well-known problems remain in ZSAR: the semantic gap and the domain shift. A semantic gap occurs because label representations come from the textual domain (i.e., language models) and must be associated with visual representations (i.e., CNNs, RNNs, transformer-based). This multimodal nature implies that the semantic properties of the two spaces are not identical. On the other hand, the domain shift arises from differences between the training and test sets and is inherent to ZSAR once the test set is unknown. One of the most promising methods to address both issues is learning joint embedding spaces. Therefore, we propose a new model that encodes videos and sentences in a joint embedding space, trained by aligning videos with their natural-language descriptions. We design an automatic negative sampling procedure to augment the training dataset and generate unpaired data, i.e., visual appearance and unrelated descriptions. Our results are state-of-the-art on the UCF-101 and Kinetics-400 datasets under several split configurations. Our code is available at https://github.com/valterlej/cezsar.