Learning 3D Reconstruction with Priors in Test Time

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a test-time optimization framework for multiview Transformers (MVTs) that improves 3D reconstruction tasks by using available priors such as camera poses, intrinsics, and depth without retraining or changing the underlying image-only model.
  • Instead of injecting priors into the network architecture, the method treats priors as constraints by adding penalty terms to the inference-time optimization objective.
  • The optimization loss combines a self-supervised multi-view consistency objective (photometric or geometric losses via view-to-view renderings) with the prior-based penalty terms on the relevant predicted outputs.
  • Experiments on benchmarks including point map estimation and camera pose estimation show large improvements over base MVTs, with point-map distance error reduced by more than half on ETH3D, 7-Scenes, and NRGBD.
  • The approach also outperforms retrained, prior-aware feed-forward baselines, highlighting test-time constrained optimization (TCO) as an effective way to incorporate priors for 3D vision.

Abstract

We introduce a test-time framework for multiview Transformers (MVTs) that incorporates priors (e.g., camera poses, intrinsics, and depth) to improve 3D tasks without retraining or modifying pre-trained image-only networks. Rather than feeding priors into the architecture, we cast them as constraints on the predictions and optimize the network at inference time. The optimization loss consists of a self-supervised objective and prior penalty terms. The self-supervised objective captures the compatibility among multi-view predictions and is implemented using photometric or geometric loss between renderings from other views and each view itself. Any available priors are converted into penalty terms on the corresponding output modalities. Across a series of 3D vision benchmarks, including point map estimation and camera pose estimation, our method consistently improves performance over base MVTs by a large margin. On the ETH3D, 7-Scenes, and NRGBD datasets, our method reduces the point-map distance error by more than half compared with the base image-only models. Our method also outperforms retrained prior-aware feed-forward methods, demonstrating the effectiveness of our test-time constrained optimization (TCO) framework for incorporating priors into 3D vision tasks.