GenMatter: Perceiving Physical Objects with Generative Matter Models

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces “GenMatter,” a generative model designed to perceive physical objects by jointly modeling motion cues and appearance features in a unified framework.
  • It hierarchically represents the scene using particles (small Gaussians) and then groups those particles into clusters that correspond to coherently and independently moveable physical entities.
  • The authors develop a hardware-accelerated inference method using parallelized block Gibbs sampling to recover stable particle motion and object groupings.
  • GenMatter is evaluated across three settings—2D random-dot kinematics, camouflaged rotating objects, and naturalistic RGB videos—showing robust object perception, 3D structure recovery from motion, and stable object-level tracking/understanding.
  • The work positions motion-based perception as grounded in human visual principles, aiming to bridge gaps where existing computer-vision systems struggle across diverse input conditions.

Abstract

Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.