SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces

arXiv cs.RO / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • SpaceDex addresses the difficulty of generalizable dexterous grasping in tiered workspaces by explicitly handling occlusion, narrow clearances, and height-dependent constraints that are often ignored in prior methods.
  • The system uses a hierarchical approach: a Vision-Language Model planner infers user intent, reasons about spatial relationships across multiple camera views, and outputs bounding boxes to enable zero-shot segmentation and mask tracking.
  • For control, SpaceDex introduces an arm-hand Feature Separation Network that decouples arm trajectory planning from hand grasp-mode selection to reduce interference between reaching and grasping behaviors.
  • The full controller combines multi-view perception, fingertip tactile sensing, and a small set of recovery demonstrations to improve robustness under partial observability and unexpected contact.
  • In 100 real-world trials across 30+ unseen objects (four categories), SpaceDex achieves a 63.0% success rate versus 39.0% for a strong tabletop baseline, showing clear gains in constrained 3D settings.

Abstract

Generalizable grasping with high-degree-of-freedom (DoF) dexterous hands remains challenging in tiered workspaces, where occlusion, narrow clearances, and height-dependent constraints are substantially stronger than in open tabletop scenes. Most existing methods are evaluated in relatively unoccluded settings and typically do not explicitly model the distinct control requirements of arm navigation and hand articulation under spatial constraints. We present SpaceDex, a hierarchical framework for dexterous manipulation in constrained 3D environments. At the high level, a Vision-Language Model (VLM) planner parses user intent, reasons about occlusion and height relations across multiple camera views, and generates target bounding boxes for zero-shot segmentation and mask tracking. This stage provides structured spatial guidance for downstream control instead of relying on single-view target selection. At the low level, we introduce an arm-hand Feature Separation Network that decouples global trajectory control for the arm from geometry-aware grasp mode selection for the hand, reducing feature interference between reaching and grasping objectives. The controller further integrates multi-view perception, fingertip tactile sensing, and a small set of recovery demonstrations to improve robustness to partial observability and off-nominal contacts. In 100 real-world trials involving over 30 unseen objects across four categories, SpaceDex achieves a 63.0\% success rate, compared with 39.0\% for a strong tabletop baseline. These results indicate that combining hierarchical spatial planning with arm-hand representation decoupling improves dexterous grasping performance in spatially constrained environments.