Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

arXiv cs.RO / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces an end-to-end, language-guided grasping pipeline for mobile legged manipulators operating in cluttered scenes where occlusions cause partial observations and unreliable depth.
  • It links open-vocabulary target selection from a natural-language command to safe real-robot grasp execution by using RGB grounding (open-vocabulary detection and promptable instance segmentation) plus object-centric point-cloud extraction from RGB-D.
  • To handle occlusion-related geometric failures, the method applies back-projected depth compensation and a two-stage point-cloud completion process before generating grasp candidates.
  • It then produces and filters 6-DoF grasp candidates with collision checking and safety-oriented heuristics focused on reachability, approach feasibility, and clearance.
  • Experiments on a quadruped robot with an arm in two cluttered tabletop setups show 90% overall success (9/10) versus 30% (3/10) for a view-dependent baseline, highlighting robustness to partial observations.

Abstract

Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.