DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
arXiv cs.CV / 4/23/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The DeVI framework addresses a key limitation of synthetic human–object interaction videos by converting primarily 2D, text-conditioned generative cues into physically plausible control for dexterous agents.
- It combines 3D human tracking with robust 2D object tracking through a hybrid reward to mitigate generative imprecision and improve fidelity for physics-based imitation.
- Unlike approaches that depend on high-quality 3D kinematic demonstrations, DeVI only requires the generated video, enabling zero-shot generalization to unseen objects and varied interaction types.
- Experiments report that DeVI outperforms methods that imitate 3D human–object interaction demonstrations, especially for dexterous hand–object interaction modeling, and works well in multi-object and text-driven diverse action scenarios.
Related Articles

Trajectory Forecasts in Unknown Environments Conditioned on Grid-Based Plans
Dev.to

Elevating Austria: Google invests in its first data center in the Alps.
Google Blog

OpenAI Just Named It Workspace Agents. We Open-Sourced Our Lark Version Six Months Ago
Dev.to

GPT Image 2 Subject-Lock Editing: A Practical Guide to input_fidelity
Dev.to

AI Tutor That Works Offline — Study Anywhere with EaseLearn AI
Dev.to