Lifting Unlabeled Internet-level Data for 3D Scene Understanding

arXiv cs.CV / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that 3D scene understanding can be improved by using abundant unlabeled internet videos, since high-quality annotated 3D data is scarce and expensive to obtain.
  • It proposes “carefully designed data engines” that automatically generate training data from web-curated unlabeled videos and then train end-to-end 3D scene understanding models alongside human-annotated datasets.
  • The authors analyze key bottlenecks in automated data generation and identify factors that govern how efficiently and effectively models learn from unlabeled sources.
  • Experiments across multiple perception granularities—ranging from 3D object detection/instance segmentation to higher-level tasks like 3D spatial VQA and vision-language navigation—validate the approach.
  • Models trained on the generated data reportedly achieve strong zero-shot performance and can further improve after fine-tuning, supporting the viability of leveraging web data for more capable scene understanding systems.

Abstract

Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.