Pickalo: Leveraging 6D Pose Estimation for Low-Cost Industrial Bin Picking

arXiv cs.RO / 4/7/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • Pickalo is a modular bin-picking pipeline that targets real industrial environments with heavy clutter and occlusions using only low-cost sensing hardware (a wrist-mounted RGB-D camera and stereo depth processing).
  • The system refines raw stereo streams with BridgeDepth, segments objects with a Mask-RCNN model trained solely on photorealistic synthetic data, and estimates 6D pose via the zero-shot SAM-6D approach.
  • A pose buffer module fuses multi-view observations over time to reduce pose noise while accounting for object symmetries, improving stability during continuous operation.
  • For grasping, Pickalo precomputes large antipodal grasp candidate sets offline and selects grasps online using utility-based ranking with fast collision checking.
  • Experiments on a UR5e with a parallel-jaw gripper and Intel RealSense D435i report up to 600 mean picks per hour with 96–99% grasp success, including robust performance over 30-minute runs, with ablations confirming the value of improved depth estimation and the pose buffer.

Abstract

Bin picking in real industrial environments remains challenging due to severe clutter, occlusions, and the high cost of traditional 3D sensing setups. We present Pickalo, a modular 6D pose-based bin-picking pipeline built entirely on low-cost hardware. A wrist-mounted RGB-D camera actively explores the scene from multiple viewpoints, while raw stereo streams are processed with BridgeDepth to obtain refined depth maps suitable for accurate collision reasoning. Object instances are segmented with a Mask-RCNN model trained purely on photorealistic synthetic data and localized using the zero-shot SAM-6D pose estimator. A pose buffer module fuses multi-view observations over time, handling object symmetries and significantly reducing pose noise. Offline, we generate and curate large sets of antipodal grasp candidates per object; online, a utility-based ranking and fast collision checking are queried for the grasp planning. Deployed on a UR5e with a parallel-jaw gripper and an Intel RealSense D435i, Pickalo achieves up to 600 mean picks per hour with 96-99% grasp success and robust performance over 30-minute runs on densely filled euroboxes. Ablation studies demonstrate the benefits of enhanced depth estimation and of the pose buffer for long-term stability and throughput in realistic industrial conditions. Videos are available at https://mesh-iit.github.io/project-jl2-camozzi/