Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations

arXiv cs.RO / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces GraspDreamer, which enables zero-shot functional robotic grasping in open-world settings by using human demonstrations synthesized via visual generative models (VGMs).
  • It argues that VGMs trained on large-scale internet human data contain latent priors about human interaction with the physical world, reducing the need for labor-intensive real-world data collection.
  • The method combines these synthesized demonstrations with embodiment-specific action optimization, allowing grasping performance with minimal additional effort.
  • Experiments on public benchmarks and real-robot evaluations show GraspDreamer improves data efficiency and generalization over prior approaches across different robot hands.
  • The work also demonstrates extensions to downstream manipulation tasks and the ability to generate data that can support visuomotor policy learning.

Abstract

Building generalist robots capable of performing functional grasping in everyday, open-world environments remains a significant challenge due to the vast diversity of objects and tasks. Existing methods are either constrained to narrow object/task sets or rely on prohibitively large-scale data collection to capture real-world variability. In this work, we present an alternative approach, GraspDreamer, a method that leverages human demonstrations synthesized by visual generative models (VGMs) (e.g., video generation models) to enable zero-shot functional grasping without labor-intensive data collection. The key idea is that VGMs pre-trained on internet-scale human data implicitly encode generalized priors about how humans interact with the physical world, which can be combined with embodiment-specific action optimization to enable functional grasping with minimal effort. Extensive experiments on the public benchmarks with different robot hands demonstrate the superior data efficiency and generalization performance of GraspDreamer compared to previous methods. Real-world evaluations further validate the effectiveness on real robots. Additionally, we showcase that GraspDreamer can (1) be naturally extended to downstream manipulation tasks, and (2) can generate data to support visuomotor policy learning.