DextER: Language-driven Dexterous Grasp Generation with Embodied Reasoning

arXiv cs.RO / 4/28/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

DextER is a language-driven model for generating dexterous multi-finger grasps that explicitly reasons about physical contact and hand–object interactions rather than directly mapping visual inputs to grasp parameters.
The method uses an intermediate, embodiment-aware representation by predicting contact relationships (which finger links contact which parts of the object surface) and then autoregressively generating contact tokens followed by grasp configuration tokens.
Experiments on DexGYS show strong performance, reaching a 67.14% success rate and outperforming prior state of the art by 3.83 percentage points.
DextER also improves intention alignment significantly (reported as a 96.4% improvement) and supports steerable generation via partial contact specification for more controllable grasp synthesis.

Abstract

Language-driven dexterous grasp generation requires the models to understand task semantics, 3D geometry, and complex hand-object interactions. While vision-language models have been applied to this problem, existing approaches directly map observations to grasp parameters without intermediate reasoning about physical interactions. We present DextER, Dexterous Grasp Generation with Embodied Reasoning, which introduces contact-based embodied reasoning for multi-finger manipulation. Our key insight is that predicting which hand links contact where on the object surface provides an embodiment-aware intermediate representation, bridging task semantics with physical constraints. DextER autoregressively generates embodied contact tokens specifying which finger links contact where on the object surface, followed by grasp tokens encoding the hand configuration. On DexGYS, DextER achieves 67.14% success rate, outperforming state-of-the-art by 3.83 p.p. with 96.4% improvement in intention alignment. We also demonstrate steerable generation through partial contact specification, providing fine-grained control over grasp synthesis.