Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

arXiv cs.CV / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles Video Unsupervised Domain Adaptation (VUDA) for action recognition, where models trained on labeled source data must adapt to unlabeled target video domains.
  • It argues that common failures stem from static, low-information backgrounds that increase domain shift, and from prior methods ignoring computational efficiency constraints.
  • The proposed Learnable Motion-Focused Tokenization (LMFT) converts frames into patch tokens while learning to drop low-motion, redundant tokens (often background) and keep motion-rich tokens tied to actions.
  • Experiments on three standard VUDA benchmarks across 21 domain adaptation settings report state-of-the-art performance along with substantial reductions in computational overhead.

Abstract

Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.