Scalable Unseen Objects 6-DoF Absolute Pose Estimation with Robotic Integration

arXiv cs.RO / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureIndustry & Market MovesModels & Research

Key Points

  • The paper tackles the scalability problem in 6-DoF absolute pose estimation for unseen objects, which existing methods struggle with when CAD models or dense reference views are unavailable.
  • It introduces SinRef-6D, a setup that estimates an unseen object’s 6-DoF pose using only a single pose-labeled reference RGB-D image obtained during robotic manipulation.
  • To cope with large pose discrepancies and limited information from a single view, the method iteratively aligns points in a shared coordinate system and uses state space model (SSM) backbones, including Point and RGB SSMs, for long-range spatial dependency modeling with linear complexity.
  • After pretraining on synthetic data, SinRef-6D achieves pose estimation from a single reference view and is further integrated into a hardware-software robotic system for real-world experiments.
  • Experiments across six benchmarks and multiple real-world scenarios show improved scalability, and additional robotic grasping tests validate the practical effectiveness of both the pose estimation and the robotic integration.

Abstract

Pose estimation-guided unseen object 6-DoF robotic manipulation is a key task in robotics. However, the scalability of current pose estimation methods to unseen objects remains a fundamental challenge, as they generally rely on CAD models or dense reference views of unseen objects, which are difficult to acquire, ultimately limit their scalability. In this paper, we introduce a novel task setup, referred to as SinRef-6D, which addresses 6-DoF absolute pose estimation for unseen objects using only a single pose-labeled reference RGB-D image captured during robotic manipulation. This setup is more scalable yet technically nontrivial due to large pose discrepancies and the limited geometric and spatial information contained in a single view. To address these issues, our key idea is to iteratively establish point-wise alignment in a common coordinate system with state space models (SSMs) as backbones. Specifically, to handle large pose discrepancies, we introduce an iterative object-space point-wise alignment strategy. Then, Point and RGB SSMs are proposed to capture long-range spatial dependencies from a single view, offering superior spatial modeling capability with linear complexity. Once pre-trained on synthetic data, SinRef-6D can estimate the 6-DoF absolute pose of an unseen object using only a single reference view. With the estimated pose, we further develop a hardware-software robotic system and integrate the proposed SinRef-6D into it in real-world settings. Extensive experiments on six benchmarks and in diverse real-world scenarios demonstrate that our SinRef-6D offers superior scalability. Additional robotic grasping experiments further validate the effectiveness of the developed robotic system. The code and robotic demos are available at https://paperreview99.github.io/SinRef-6DoF-Robotic.