GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation

Microsoft Research Blog / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The article describes challenges in using vision-language models (VLMs) for robot manipulation, specifically difficulty in choosing which actions to take and determining where to take them over long horizons.
  • It notes that many existing systems decouple planning and execution by having a VLM output a natural-language plan and a separate model convert that plan into executable actions, which can cause failures.
  • It introduces “GroundedPlanBench,” a spatially grounded approach/dataset (and corresponding framing) intended to improve long-horizon task planning by tying plans to spatial information rather than relying only on language-level instructions.
  • The focus is on advancing end-to-end grounding for robot manipulation tasks, improving the reliability of decision-making for action selection and spatial placement.

Vision-language models (VLMs) use images and text to plan robot actions, but they still struggle to decide what actions to take and where to take them. Most systems split these decisions into two steps: a VLM generates a plan in natural language, and a separate model translates it into executable actions. This approach often breaks […]

The post GroundedPlanBench: Spatially grounded long-horizon task planning for robot manipulation appeared first on Microsoft Research.