Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

arXiv cs.RO / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper introduces an end-to-end, language-guided grasping pipeline for mobile legged manipulators operating in cluttered scenes where occlusions cause partial observations and unreliable depth.
It links open-vocabulary target selection from a natural-language command to safe real-robot grasp execution by using RGB grounding (open-vocabulary detection and promptable instance segmentation) plus object-centric point-cloud extraction from RGB-D.
To handle occlusion-related geometric failures, the method applies back-projected depth compensation and a two-stage point-cloud completion process before generating grasp candidates.
It then produces and filters 6-DoF grasp candidates with collision checking and safety-oriented heuristics focused on reachability, approach feasibility, and clearance.
Experiments on a quadruped robot with an arm in two cluttered tabletop setups show 90% overall success (9/10) versus 30% (3/10) for a view-dependent baseline, highlighting robustness to partial observations.

Abstract

Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions that lead to partial observations, unreliable depth estimates, and the need for collision-free, execution-feasible approaches. In this paper we present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable instance segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a quadruped robot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% overall success rate (9/10) against 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.

SIFS (SIFS Is Fast Search) - local code search for coding agents

Dev.to

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Reddit r/LocalLLaMA

Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

Key Points

Abstract

Related Articles

SIFS (SIFS Is Fast Search) - local code search for coding agents

BizNode's semantic memory (Qdrant) makes your bot smarter over time — it remembers past conversations and answers...

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Solidity LM surpasses Opus

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer