From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
arXiv cs.RO / 4/7/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces FSD (From Seeing to Doing), a vision-language model designed to improve generalization for robotic manipulation in unseen scenarios and novel tasks.
- Unlike typical Vision-Language-Action approaches, FSD generates intermediate representations via spatial relationship reasoning to provide fine-grained guidance for physical manipulation.
- The method uses a hierarchical training data pipeline and a self-consistency mechanism to align spatial coordinates with visual signals, aiming to reduce failures caused by limited and heterogeneous embodied datasets.
- Experiments validate strong performance on eight benchmarks for general spatial reasoning and embodied reference, and on the more challenging VABench benchmark.
- For robot manipulation, the authors report significant zero-shot gains, including 40.6% success in SimplerEnv and 72% success across eight real-world tasks, beating the strongest baseline by 30%.
Related Articles

Why Anthropic’s new model has cybersecurity experts rattled
Reddit r/artificial
Does the AI 2027 paper still hold any legitimacy?
Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)
Dev.to

Moving from proof of concept to production: what we learned with Nometria
Dev.to

Frontend Engineers Are Becoming AI Trainers
Dev.to