Disentangled Robot Learning via Separate Forward and Inverse Dynamics Pretraining
arXiv cs.RO / 4/21/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes DeFI, a framework for disentangled robot learning that separates 2D visual forward dynamics (future prediction) from 3D action/inverse dynamics (action inference).
- It introduces two specialized pretrained components: GFDM for future state forecasting using diverse human/robot videos, and GIDM for learning latent actions from unlabeled video transitions via self-supervised learning.
- The approach integrates GFDM and GIDM into a unified architecture for end-to-end fine-tuning on downstream robotic tasks.
- Experiments on CALVIN ABC-D and SimplerEnv show state-of-the-art results, including an average task length of 4.51 on CALVIN, 51.2% success on SimplerEnv-Fractal, and 81.3% success in real-world deployment.
- By decoupling video generation and action prediction, DeFI aims to overcome entangled training limitations and better leverage large-scale action-free web video data.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
Rethinking Coding Education for the AI Era
Dev.to
We Shipped an MVP With Vibe-Coding. Here's What Nobody Tells You About the Aftermath
Dev.to

Agent Package Manager (APM): A DevOps Guide to Reproducible AI Agents
Dev.to
3 Things I Learned Benchmarking Claude, GPT-4o, and Gemini on Real Dev Work
Dev.to
Open Source Contributors Needed for Skillware & Rooms (AI/ML/Python)
Dev.to