ViBA: Implicit Bundle Adjustment with Geometric and Temporal Consistency for Robust Visual Matching

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • ViBA is a research framework for keypoint feature learning that enables scalable training on unconstrained video streams without relying on datasets with accurate pose/depth annotations.
  • It couples an initial tracking network with depth-based outlier filtering and an implicitly differentiable global bundle adjustment module that jointly refines camera poses and feature positions via reprojection error minimization.
  • By combining geometric consistency from bundle adjustment with long-term temporal consistency across frames, ViBA aims to produce more stable and accurate visual feature representations for localization.
  • Experiments on EuRoC and UMA show improved navigation performance over methods like SuperPoint+SuperGlue, ALIKED, and LightGlue, with 12–18% lower mean absolute translation error and 5–10% lower absolute rotation error while maintaining real-time inference speeds (36–91 FPS).
  • On unseen sequences, ViBA sustains over 90% localization accuracy, indicating strong generalization and suitability for continuous online training in real-world scenarios.

Abstract

Most existing image keypoint detection and description methods rely on datasets with accurate pose and depth annotations, limiting scalability and generalization, and often degrading navigation and localization performance. We propose ViBA, a sustainable learning framework that integrates geometric optimization with feature learning for continuous online training on unconstrained video streams. Embedded in a standard visual odometry pipeline, it consists of an implicitly differentiable geometric residual framework: (i) an initial tracking network for inter-frame correspondences, (ii) depth-based outlier filtering, and (iii) differentiable global bundle adjustment that jointly refines camera poses and feature positions by minimizing reprojection errors. By combining geometric consistency from BA with long-term temporal consistency across frames, ViBA enforces stable and accurate feature representations. We evaluate ViBA on EuRoC and UMA datasets. Compared with state-of-the-art methods such as SuperPoint+SuperGlue, ALIKED, and LightGlue, ViBA reduces mean absolute translation error (ATE) by 12-18% and absolute rotation error (ARE) by 5-10% across sequences, while maintaining real-time inference speeds (FPS 36-91). When evaluated on unseen sequences, it retains over 90% localization accuracy, demonstrating robust generalization. These results show that ViBA supports continuous online learning with geometric and temporal consistency, consistently improving navigation and localization in real-world scenarios.