ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

arXiv cs.RO / 4/15/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

ABot-M0 proposes a framework for general-purpose embodied robotic agents by building a systematic data curation pipeline that turns heterogeneous raw robot data into unified, efficient representations.
The work introduces UniACT-dataset, created from six public datasets, containing 6M+ trajectories and 9,500+ hours spanning diverse robot morphologies and task scenarios, with unified pre-training to improve cross-platform generalization.
It advances an “Action Manifold Hypothesis,” arguing that feasible robot actions lie on a low-dimensional smooth manifold constrained by physics and tasks, and implements Action Manifold Learning (AML) using a DiT backbone to predict clean, continuous action sequences.
For modular perception, ABot-M0 uses a dual-stream design combining VLM semantics with geometric priors and plug-and-play multi-view 3D modules to strengthen spatial reasoning while limiting typical VLM 3D weaknesses.
The authors report additive, component-wise benefits and state that code and pipelines will be released for reproducibility and further research.

Abstract

Building general-purpose embodied agents across diverse hardware remains a central challenge in robotics, often framed as the ''one-brain, many-forms'' paradigm. Progress is hindered by fragmented data, inconsistent representations, and misaligned training objectives. We present ABot-M0, a framework that builds a systematic data curation pipeline while jointly optimizing model architecture and training strategies, enabling end-to-end transformation of heterogeneous raw data into unified, efficient representations. From six public datasets, we clean, standardize, and balance samples to construct UniACT-dataset, a large-scale dataset with over 6 million trajectories and 9,500 hours of data, covering diverse robot morphologies and task scenarios. Unified pre-training improves knowledge transfer and generalization across platforms and tasks, supporting general-purpose embodied intelligence. To improve action prediction efficiency and stability, we propose the Action Manifold Hypothesis: effective robot actions lie not in the full high-dimensional space but on a low-dimensional, smooth manifold governed by physical laws and task constraints. Based on this, we introduce Action Manifold Learning (AML), which uses a DiT backbone to predict clean, continuous action sequences directly. This shifts learning from denoising to projection onto feasible manifolds, improving decoding speed and policy stability. ABot-M0 supports modular perception via a dual-stream mechanism that integrates VLM semantics with geometric priors and multi-view inputs from plug-and-play 3D modules such as VGGT and Qwen-Image-Edit, enhancing spatial understanding without modifying the backbone and mitigating standard VLM limitations in 3D reasoning. Experiments show components operate independently with additive benefits. We will release all code and pipelines for reproducibility and future research.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/15DailyView insight →

Black Hat Asia

AI Business

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

Reddit r/MachineLearning

I built a trading intelligence MCP server in 2 days — here's how

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

Reddit r/LocalLLaMA

ABot-M0: VLA Foundation Model for Robotic Manipulation with Action Manifold Learning

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Are gamers being used as free labeling labor? The rise of "Simulators" that look like AI training grounds [D]

I built a trading intelligence MCP server in 2 days — here's how

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Qwen3.5-35B running well on RTX4060 Ti 16GB at 60 tok/s

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer