CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

arXiv cs.RO / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes CoMo, an unsupervised method for learning continuous latent motion representations from large-scale internet videos to support scalable robot learning.
It addresses limitations of prior discrete latent-motion approaches, which can induce shortcut learning (e.g., over-extracting static backgrounds) and also suffer from information loss and difficulty modeling fine-grained dynamics.
CoMo introduces an early temporal-difference (Td) mechanism to make shortcut learning harder and to strengthen motion cues in the learned latents.
It adds temporal contrastive learning (Tcl), using small positive temporal offsets and reversed-direction negatives to encourage latents to focus on meaningful foreground motion.
Experiments in simulation and the real world show strong zero-shot generalization, enabling CoMo to produce effective pseudo action labels for unseen videos and improve co-trained robot policies across both diffusion and auto-regressive architectures.

Abstract

Unsupervised learning of latent motion from Internet videos is crucial for robot learning. Existing discrete methods generally mitigate the shortcut learning caused by extracting excessive static backgrounds through vector quantization with a small codebook size. However, they suffer from information loss and struggle to capture more complex and fine-grained dynamics. Moreover, there is an inherent gap between the distribution of discrete latent motion and continuous robot action, which hinders the joint learning of a unified policy. We propose CoMo, which aims to learn more precise continuous latent motion from internet-scale videos. CoMo employs an early temporal difference (Td) mechanism to increase the shortcut learning difficulty and explicitly enhance motion cues. Additionally, to ensure latent motion better captures meaningful foregrounds, we further propose a temporal contrastive learning (Tcl) scheme. Specifically, positive pairs are constructed with a small future frame temporal offset, while negative pairs are formed by directly reversing the temporal direction. The proposed Td and Tcl work synergistically and effectively ensure that the latent motion focuses better on the foreground and reinforces motion cues. Critically, CoMo exhibits strong zeroshot generalization, enabling it to generate effective pseudo action labels for unseen videos. Extensive simulated and real-world experiments show that policies co-trained with CoMo pseudo action labels achieve superior performance with both diffusion and auto-regressive architectures.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer