Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

arXiv cs.RO / 3/26/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces Xiaomi-Robotics-0, an open-sourced vision-language-action (VLA) model designed for high-performance, real-time robot control.
It uses a training approach that pre-trains on large-scale cross-embodiment robot trajectories and vision-language data while mitigating catastrophic forgetting to preserve visual-semantic knowledge.
Post-training techniques target asynchronous execution to reduce inference latency during real-robot rollouts.
The deployment strategy aligns the timesteps of consecutive predicted action chunks to produce continuous, seamless real-time behavior.
Experiments show state-of-the-art results in simulation benchmarks and strong performance on two demanding bimanual real-robot manipulation tasks, with fast rollouts on a consumer-grade GPU; code and checkpoints are open-sourced via the project site.

Abstract

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

I asked my AI agent to design a product launch image. Here's what came back.

Dev.to

They Did Not Accidentally Make Work the Answer to Who You Are

Dev.to

Welsh government used Copilot for review to justify closing organization

The Register

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

I asked my AI agent to design a product launch image. Here's what came back.

They Did Not Accidentally Make Work the Answer to Who You Are

Welsh government used Copilot for review to justify closing organization

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer