CAKE: Real-time Action Detection via Motion Distillation and Background-aware Contrastive Learning

arXiv cs.CV / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces CAKE, a real-time Online Action Detection (OAD) framework designed to address the twin issues of high compute cost and weak modeling of discriminative temporal dynamics versus background motion.
  • Instead of computing optical flow explicitly, CAKE uses a motion knowledge distillation approach that transfers flow-like motion cues into an RGB model.
  • It proposes a Dynamic Motion Adapter (DMA) that suppresses static background noise and highlights pixel changes, effectively approximating optical-flow information without its overhead.
  • The framework adds Floating Contrastive Learning to better separate informative motion dynamics from temporal background signals.
  • Experiments on TVSeries, THUMOS’14, and Kinetics-400 report strong mean Average Precision (mAP) improvements over state of the art with the same backbone, while achieving over 72 FPS on a single CPU, supporting deployment in resource-constrained settings.

Abstract

Online Action Detection (OAD) systems face two primary challenges: high computational cost and insufficient modeling of discriminative temporal dynamics against background motion. Adding optical flow could provides strong motion cues but it incurs significant computational overhead. We propose CAKE, a OAD Flow-based distillation framework to transfer motion knowledge into RGB models. We propose Dynamic Motion Adapter (DMA) to suppress static background noise and emphasize pixel changes, effectively approximating optical flow without explicit computation. The framework also integrates a Floating Contrastive Learning strategy to distinguish informative motion dynamics from temporal background. Various experiments conducted on the TVSeries, THUMOS'14, Kinetics-400 datasets show effectiveness of our model. CAKE achieves a standout mAP compared with SOTA while using the same backbone. Our model operates at over 72 FPS on a single CPU, making it highly suitable for resource-constrained systems.