CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
arXiv cs.RO / 3/30/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes CoMo, an unsupervised method for learning continuous latent motion representations from large-scale internet videos to support scalable robot learning.
- It addresses limitations of prior discrete latent-motion approaches, which can induce shortcut learning (e.g., over-extracting static backgrounds) and also suffer from information loss and difficulty modeling fine-grained dynamics.
- CoMo introduces an early temporal-difference (Td) mechanism to make shortcut learning harder and to strengthen motion cues in the learned latents.
- It adds temporal contrastive learning (Tcl), using small positive temporal offsets and reversed-direction negatives to encourage latents to focus on meaningful foreground motion.
- Experiments in simulation and the real world show strong zero-shot generalization, enabling CoMo to produce effective pseudo action labels for unseen videos and improve co-trained robot policies across both diffusion and auto-regressive architectures.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer
Simon Willison's Blog
Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026
Dev.to

I missed the "fun" part in software development
Dev.to

The Billion Dollar Tax on AI Agents
Dev.to