SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

arXiv cs.RO / 4/28/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that robotic foundation models (RFMs) generalize poorly because they are often fine-tuned from internet-trained 2D vision-language models that lack inherent 3D spatial reasoning needed for embodied control.
Instead of scaling expensive robot data, it proposes enriching easy-to-collect non-robotic image data with 3D annotations and upgrading a pretrained VLM with 3D understanding.
The authors train SPEAR-VLM, a 3D-aware VLM that predicts 3D object coordinates from a single 2D image, and then build SPEAR-1 by combining grounded 3D perception with language-instructed embodied control.
SPEAR-1 is trained on ~45M frames from 24 Open X-Embodiment datasets and reportedly matches or exceeds state-of-the-art models (e.g., π0-FAST and π0.5) while requiring about 20× fewer robot demonstrations.
The model weights and the 3D-annotated datasets are released publicly to support further research and replication.

Abstract

Robotic Foundation Models (RFMs) hold great promise as generalist, end-to-end systems for robot control. Yet their ability to generalize across new environments, tasks, and embodiments remains limited. We argue that a major bottleneck lies in their foundations: most RFMs are built by fine-tuning internet-pretrained Vision-Language Models (VLMs). However, these VLMs are trained on 2D image-language tasks and lack the 3D spatial reasoning inherently required for embodied control in the 3D world. Bridging this gap directly with large-scale robotic data is costly and difficult to scale. Instead, we propose to enrich easy-to-collect non-robotic image data with 3D annotations and enhance a pretrained VLM with 3D understanding capabilities. Following this strategy, we train SPEAR-VLM, a 3D-aware VLM that infers object coordinates in 3D space from a single 2D image. Building on SPEAR-VLM, we introduce our main contribution,

~\textbf{SPEAR-1}

: a robotic foundation model that integrates grounded 3D perception with language-instructed embodied control. Trained on

\sim

45M frames from 24 Open X-Embodiment datasets, SPEAR-1 outperforms or matches state-of-the-art models such as

\pi_0

-FAST and

\pi_{0.5}

, while it uses 20

\times

fewer robot demonstrations. This carefully-engineered training strategy unlocks new VLM capabilities and as a consequence boosts the reliability of embodied control beyond what is achievable with only robotic data. We make our model weights and 3D-annotated datasets publicly available at https://spear.insait.ai.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Dev.to

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

Dev.to

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

Dev.to

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

Dev.to

SPEAR-1: Scaling Beyond Robot Demonstrations via 3D Understanding

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Same Agent, Different Risk | How Microsoft 365 Copilot Grounding Changes the Security Model | Rahsi Framework™

Claude Haiku for Low-Cost AI Inference: Patterns from a Horse Racing Prediction System

How We Built an Ambient AI Clinical Documentation Pipeline (and Saved Doctors 8+ Hours a Week)

🦀 PicoClaw Deep Dive — A Field Guide to Building an Ultra-Light AI Agent in Go 🐹

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer