Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

arXiv cs.RO / 5/6/2026

💬 OpinionDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces a deep learning framework to improve how quadruped robots equipped with arms grasp objects, emphasizing both precision and adaptability.
  • It uses a sim-to-real pipeline in the Genesis simulation environment to generate a large synthetic dataset of grasp attempts, producing pixel-wise grasp-quality maps as ground truth.
  • A custom CNN with a U-Net-like architecture is trained on multi-modal onboard sensing inputs (RGB, depth, segmentation masks, and surface normal maps) to output a grasp-quality heatmap.
  • The method was validated on a four-legged robot that successfully completed a loco-manipulation task end-to-end, including navigation, perception, grasp pose prediction, and executing a precise grasp.
  • The work argues that training with advanced simulated sensing can scale effectively while reducing the need for costly physical data collection.

Abstract

This paper presents a deep learning framework designed to enhance the grasping capabilities of quadrupeds equipped with arms, with a focus on improving precision and adaptability. Our approach centers on a sim-to-real methodology that minimizes reliance on physical data collection. We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects. By simulating thousands of interactions from various perspectives, we created pixel-wise annotated grasp-quality maps to serve as the ground truth for our model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs a grasp-quality heatmap to identify the optimal grasp point. We validated the complete framework on a four-legged robot. The system successfully executed a full loco-manipulation task: autonomously navigating to a target object, perceiving it with its sensors, predicting the optimal grasp pose using our model, and performing a precise grasp. This work proves that leveraging simulated training with advanced sensing offers a scalable and effective solution for object handling.