Do vision models perceive illusory motion in static images like humans?

arXiv cs.CV / 4/14/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates whether DNN-based vision/optical-flow models can perceive illusory motion from static images, specifically testing the Rotating Snakes illusion against human motion perception.
  • Most evaluated optical-flow models fail to produce motion/flow fields that match human expectations, indicating a significant mismatch in how machines and humans process such illusions.
  • Under simulated saccadic eye-movement conditions, only a human-inspired Dual-Channel model shows the expected rotational motion, with the best correspondence occurring during the saccade simulation.
  • Ablation studies suggest that both luminance signals and higher-order color/feature-based motion cues matter, and that recurrent attention is critical for integrating local cues to form the illusion-consistent motion interpretation.
  • The findings point to a gap between current motion-estimation systems and human visual motion processing, offering design directions for more human-aligned computer vision models.

Abstract

Understanding human motion processing is essential for building reliable, human-centered computer vision systems. Although deep neural networks (DNNs) achieve strong performance in optical flow estimation, they remain less robust than humans and rely on fundamentally different computational strategies. Visual motion illusions provide a powerful probe into these mechanisms, revealing how human and machine vision align or diverge. While recent DNN-based motion models can reproduce dynamic illusions such as reverse-phi, it remains unclear whether they can perceive illusory motion in static images, exemplified by the Rotating Snakes illusion. We evaluate several representative optical flow models on Rotating Snakes and show that most fail to generate flow fields consistent with human perception. Under simulated conditions mimicking saccadic eye movements, only the human-inspired Dual-Channel model exhibits the expected rotational motion, with the closest correspondence emerging during the saccade simulation. Ablation analyses further reveal that both luminance-based and higher-order color--feature--based motion signals contribute to this behavior and that a recurrent attention mechanism is critical for integrating local cues. Our results highlight a substantial gap between current optical-flow models and human visual motion processing, and offer insights for developing future motion-estimation systems with improved correspondence to human perception and human-centric AI.