WARPED: Wrist-Aligned Rendering for Robot Policy Learning from Egocentric Human Demonstrations

arXiv cs.RO / 4/14/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces WARPED, a framework that synthesizes realistic wrist-aligned (robot-like) observations from egocentric human demonstration videos to train visuomotor policies.
  • It enables training using only monocular RGB data by collecting from a wrist/hand-level camera, initializing the scene with vision foundation models, tracking hand–object interactions, and retargeting motion to a robot end-effector.
  • WARPED generates photo-realistic wrist-view inputs using Gaussian Splatting, allowing policies to be trained directly on these synthesized observations rather than relying on specialized multiview/depth hardware.
  • Experiments on five tabletop manipulation tasks show success rates comparable to policies trained from teleoperated demonstrations while reducing human data collection time by 5–8×.

Abstract

Recent advancements in learning from human demonstration have shown promising results in addressing the scalability and high cost of data collection required to train robust visuomotor policies. However, existing approaches are often constrained by a reliance on multiview camera setups, depth sensors, or custom hardware and are typically limited to policy execution from third-person or egocentric cameras. In this paper, we present WARPED, a framework designed to synthesize realistic wrist-view observations from human demonstration videos to facilitate the training of visuomotor policies using only monocular RGB data. With data collected from an egocentric RGB camera, our system leverages vision foundation models to initialize the interactive scene. A hand-object interaction pipeline is then employed to track the hand and manipulated object and retarget the trajectories to a robotic end-effector. Lastly, photo-realistic wrist-view observations are synthesized via Gaussian Splatting to directly train a robotic policy. We demonstrate that WARPED achieves success rates comparable to policies trained on teleoperated demonstration data for five tabletop manipulation tasks, while requiring 5-8x less data collection time.