Relational Epipolar Graphs for Robust Relative Camera Pose Estimation

arXiv cs.RO / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a new formulation for relative camera pose estimation by treating it as relational inference on epipolar correspondence graphs rather than relying on sampling/iteration or purely learned geometry.
  • Matched keypoints become nodes in a graph, with edges connecting nearby correspondences, and graph operations (pruning, message passing, pooling) are used to estimate rotation (quaternion), translation, and the Essential Matrix.
  • Training uses a multi-term loss that compares estimated results to ground truth, including direct pose errors plus geometric constraints via Essential Matrix (Frobenius/singular value) and heading/scale differences.
  • Using LoFTR for detector-free dense matching, experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation over classical and learning-guided baselines.
  • Overall, the work highlights that enforcing explicit geometric structure through global relational consensus can improve VSLAM-critical pose estimation under challenging correspondence conditions.

Abstract

A key component of Visual Simultaneous Localization and Mapping (VSLAM) is estimating relative camera poses using matched keypoints. Accurate estimation is challenged by noisy correspondences. Classical methods rely on stochastic hypothesis sampling and iterative estimation, while learning-based methods often lack explicit geometric structure. In this work, we reformulate relative pose estimation as a relational inference problem over epipolar correspondence graphs, where matched keypoints are nodes and nearby ones are connected by edges. Graph operations such as pruning, message passing, and pooling estimate a quaternion rotation, translation vector, and the Essential Matrix (EM). Minimizing a loss comprising (i) \mathcal{L}_2 differences with ground truth (GT), (ii) Frobenius norm between estimated and GT EMs, (iii) singular value differences, (iv) heading angle differences, and (v) scale differences, yields the relative pose between image pairs. The dense detector-free method LoFTR is used for matching. Experiments on indoor and outdoor benchmarks show improved robustness to dense noise and large baseline variation compared to classical and learning-guided approaches, highlighting the effectiveness of global relational consensus.