From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

arXiv cs.RO / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces FSD (From Seeing to Doing), a vision-language model designed to improve generalization for robotic manipulation in unseen scenarios and novel tasks.
Unlike typical Vision-Language-Action approaches, FSD generates intermediate representations via spatial relationship reasoning to provide fine-grained guidance for physical manipulation.
The method uses a hierarchical training data pipeline and a self-consistency mechanism to align spatial coordinates with visual signals, aiming to reduce failures caused by limited and heterogeneous embodied datasets.
Experiments validate strong performance on eight benchmarks for general spatial reasoning and embodied reference, and on the more challenging VABench benchmark.
For robot manipulation, the authors report significant zero-shot gains, including 40.6% success in SimplerEnv and 72% success across eight real-world tasks, beating the strongest baseline by 30%.

Abstract

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

Moving from proof of concept to production: what we learned with Nometria

Dev.to

Frontend Engineers Are Becoming AI Trainers

Dev.to

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Key Points

Abstract

Related Articles

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

Moving from proof of concept to production: what we learned with Nometria

Frontend Engineers Are Becoming AI Trainers

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer