Appearance-free Action Recognition: Zero-shot Generalization in Humans and a Two-Pathway Model

arXiv cs.CV / 4/21/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study investigates whether humans can perform zero-shot action recognition on appearance-free transformations of real-world videos, where static body shape cues are removed.
  • In a lab experiment with 22 participants, people trained on naturalistic UCF5 videos still recognized actions above chance level on two appearance-free variants (AFD5 dense-noise motion and random-dot motion videos), though accuracy dropped.
  • The authors propose a two-pathway 3D CNN model with separate RGB (form) and optical-flow (motion) streams plus a coherence-gating mechanism inspired by Gestalt “common-fate” grouping.
  • The model matches the generalization behavior on both appearance-free datasets and outperforms contemporary video classification models, with motion cues proving critical for zero-shot appearance-free generalization and form cues helping on naturalistic videos.

Abstract

Action recognition is a fundamental ability for social species. Yet, its underlying computations are not well understood. Classical psychophysical studies using simplified stimuli have shown that humans can perceive body motion even under degradation of relevant shape cues. Recent work using real-world action videos and their appearance-free counterparts (that preserve motion but lack static shape cues) included explicit training of humans and models on the appearance-free videos. Whether humans and vision models generalize in a zero-shot manner to appearance-free transformations of real-world action videos is not yet known. To measure this generalization in humans, we conducted a laboratory-based psychophysics experiment. 22 participants were trained to recognize five action categories using naturalistic videos (UCF5 dataset), and tested zero-shot on two types of appearance-free transformations: (i) dense-noise motion videos from an existing dataset (AFD5) and (ii) random-dot appearance-free videos. We find that participants recognize actions in both types of appearance-free videos well above chance, albeit with reduced accuracy compared to naturalistic videos. To model this behavior, we developed a two-pathway 3D CNN-based model combining an RGB (form) stream and an optical flow (motion) stream, including a coherence-gating mechanism inspired by Gestalt common-fate grouping. Our model generalizes to both appearance-free datasets and outperforms contemporary video classification models, narrowing the gap to human performance. We find that the motion pathway is critical for generalization to appearance-free videos, while the form pathway improves performance on naturalistic videos. Our findings highlight the importance of motion-based representations for generalization to appearance-free videos, and support the use of multi-stream architectures to model video-based action recognition.