Fisheye3R: Adapting Unified 3D Feed-Forward Foundation Models to Fisheye Lenses

arXiv cs.CV / 4/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that feed-forward foundation models for multi-view 3D reconstruction degrade on fisheye (wide FOV) images because non-linear fisheye projection changes pixel spatial positions in ways the perspective-trained models were not exposed to.
  • It proposes Fisheye3R, an adaptation framework designed to extend existing multi-view 3D reconstruction foundation models to natively handle fisheye inputs while avoiding regression on perspective images.
  • To overcome limited fisheye data and scarce ground-truth supervision, the authors introduce flexible learning strategies that enable self-supervised adaptation using only unlabeled perspective images.
  • They also present a supervised adaptation mode that can improve fisheye performance without requiring any fisheye training data.
  • Experiments on three foundation models (VGGT, π^3, and MapAnything) show consistent gains in camera pose, depth, point maps, and field-of-view estimation for fisheye imagery.

Abstract

Feed-forward foundation models for multi-view 3-dimensional (3D) reconstruction have been trained on large-scale datasets of perspective images; when tested on wide field-of-view images, e.g., from a fisheye camera, their performance degrades. Their error arises from changes in spatial positions of pixels due to a non-linear projection model that maps 3D points onto the 2D image plane. While one may surmise that training on fisheye images would resolve this problem, there are far fewer fisheye images with ground truth than perspective images, which limit generalization. To enable inference on imagery exhibiting high radial distortion, we propose Fisheye3R, a novel adaptation framework that extends these multi-view 3D reconstruction foundation models to natively accommodate fisheye inputs without performance regression on perspective images. To address the scarcity of fisheye images and ground truth, we introduce flexible learning schemes that support self-supervised adaptation using only unlabeled perspective images and supervised adaptation without any fisheye training data. Extensive experiments across three foundation models, including VGGT, \pi^3, and MapAnything, demonstrate that our approach consistently improves camera pose, depth, point map, and field-of-view estimation on fisheye images.