SS3D: End2End Self-Supervised 3D from Web Videos

arXiv cs.CV / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces SS3D, an end-to-end self-supervised 3D pretraining pipeline that learns feed-forward 3D estimation from monocular web videos using SfM-based supervision.
  • SS3D predicts depth, ego-motion, and camera intrinsics jointly in a single forward pass, with an intrinsics-first two-stage training schedule to stabilize learning.
  • To make SfM self-supervision work on unconstrained, heterogeneous web video, the authors use a multi-view signal proxy (MVS) for filtering and curriculum sampling.
  • They also distill expert training into a single student model, and report that pretraining on YouTube-8M (about 100M frames after filtering) improves zero-shot transfer and fine-tuning versus prior self-supervised baselines.
  • The authors release the pretrained checkpoint and code to support replication and further research.

Abstract

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.