PhysLayer: Language-Guided Layered Animation with Depth-Aware Physics

arXiv cs.CV / 4/28/2026

📰 NewsModels & Research

Key Points

  • PhysLayer is introduced as a framework for language-guided, depth-aware layered animation from static images, aiming to fix physically implausible motion and limited dynamical control in existing image-to-video methods.
  • The method uses a language-guided scene understanding module (built on vision foundation models) to decompose scenes into depth-based layers using object composition, material properties, and physical parameters.
  • It introduces a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, improving realistic interactions without full 3D reconstruction.
  • A physics-guided video synthesis module combines simulated object trajectories with scene-aware relighting to produce temporally coherent, text-aligned video outputs.
  • Experiments report improvements in CLIP-Similarity (+2.2%), FID (+9.3%), and Motion-FID (+3%), along with large gains in human ratings for physical plausibility (+24%) and text-video alignment (+35%).

Abstract

Existing image-to-video generation methods often produce physically implausible motions and lack precise control over object dynamics. While prior approaches have incorporated physics simulators, they remain confined to 2D planar motions and fail to capture depth-aware spatial interactions. We introduce PhysLayer, a novel framework enabling language-guided, depth-aware layered animation of static images. PhysLayer consists of three key components: First, a language-guided scene understanding module that utilizes vision foundation models to decompose scenes into depth-based layers by analyzing object composition, material properties, and physical parameters. Second, a depth-aware layered physics simulation that extends 2D rigid-body dynamics with depth motion and perspective-consistent scaling, enabling more realistic object interactions without requiring full 3D reconstruction. Third, a physics-guided video synthesis module that integrates simulated trajectories with scene-aware relighting for temporally coherent results. Experimental results demonstrate improvements in CLIP-Similarity (+2.2\%), FID score (+9.3\%), and Motion-FID (+3\%), with human evaluation showing enhanced physical plausibility (+24\%) and text-video alignment (+35\%). Our approach provides a practical balance between physical realism and computational efficiency for controllable image animation.