Long-tail Internet photo reconstruction

arXiv cs.CV / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

共有:

Key Points

The paper highlights a “long-tail” challenge in Internet photo-to-3D reconstruction: well-known landmarks have abundant, clean imagery and are easy to reconstruct, while most sites have sparse, noisy, uneven photos that break both classical and learned 3D methods.
It argues that solving this regime is a key next frontier for 3D foundation models, where obtaining reliable ground-truth supervision from sparse scenes is difficult.
The authors propose simulating ground-truth supervision by sampling sparse subsets from well-reconstructed Internet landmarks, creating training conditions that resemble long-tail camera distributions.
They introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, along with a sampling strategy to form training image sets for extreme sparsity.
Fine-tuning 3D foundation models using MegaDepth-X and the sampling approach improves robustness under extreme sparsity and helps in symmetric/repetitive scenes without losing performance on standard dense 3D benchmarks.

Abstract

Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.