From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

arXiv cs.RO / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces FSD (From Seeing to Doing), a vision-language model designed to improve generalization for robotic manipulation in unseen scenarios and novel tasks.
  • Unlike typical Vision-Language-Action approaches, FSD generates intermediate representations via spatial relationship reasoning to provide fine-grained guidance for physical manipulation.
  • The method uses a hierarchical training data pipeline and a self-consistency mechanism to align spatial coordinates with visual signals, aiming to reduce failures caused by limited and heterogeneous embodied datasets.
  • Experiments validate strong performance on eight benchmarks for general spatial reasoning and embodied reference, and on the more challenging VABench benchmark.
  • For robot manipulation, the authors report significant zero-shot gains, including 40.6% success in SimplerEnv and 72% success across eight real-world tasks, beating the strongest baseline by 30%.

Abstract

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD's capabilities in both "seeing" and "doing," achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.