OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks

arXiv cs.RO / 4/6/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces OMNI-PoseX, a fast “vision foundation model” aimed at accurate 6D object pose estimation for embodied agents in open-world settings where existing methods struggle with generalization and stability.
  • OMNI-PoseX uses a novel architecture that combines open-vocabulary perception with an SO(3)-aware reflected flow pose predictor, separating object understanding from geometry-consistent rotation inference.
  • A lightweight multimodal fusion approach conditions rotation-sensitive geometric features on compact semantic embeddings to support real-time and stable pose estimation.
  • The model is trained on large-scale 6D pose datasets to improve robustness across diverse objects, viewpoints, and scenes, and the paper reports strong results on benchmarks including zero-shot generalization.
  • System-level experiments integrate OMNI-PoseX into robotic grasping, showing reliable, geometrically consistent predictions for previously unseen objects while achieving state-of-the-art accuracy and real-time efficiency.

Abstract

Accurate 6D object pose estimation is a fundamental capability for embodied agents, yet remains highly challenging in open-world environments. Many existing methods often rely on closed-set assumptions or geometry-agnostic regression schemes, limiting their generalization, stability, and real-time applicability in robotic systems. We present OMNI-PoseX, a vision foundation model that introduces a novel network architecture unifying open-vocabulary perception with an SO(3)-aware reflected flow matching pose predictor. The architecture decouples object-level understanding from geometry-consistent rotation inference, and employs a lightweight multi-modal fusion strategy that conditions rotation-sensitive geometric features on compact semantic embeddings, enabling efficient and stable 6D pose estimation. To enhance robustness and generalization, the model is trained on large-scale 6D pose datasets, leveraging broad object diversity, viewpoint variation, and scene complexity to build a scalable open-world pose backbone. Comprehensive evaluations across benchmark pose estimation, ablation studies, zero-shot generalization, and system-level robotic grasping integration demonstrate the effectiveness of OMNI-PoseX. The OMNI-PoseX achieves SOTA pose accuracy and real-time efficiency, while delivering geometrically consistent predictions that enable reliable grasping of diverse, previously unseen objects.