SS3D: End2End Self-Supervised 3D from Web Videos
arXiv cs.CV / 4/27/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces SS3D, an end-to-end self-supervised 3D pretraining pipeline that learns feed-forward 3D estimation from monocular web videos using SfM-based supervision.
- SS3D predicts depth, ego-motion, and camera intrinsics jointly in a single forward pass, with an intrinsics-first two-stage training schedule to stabilize learning.
- To make SfM self-supervision work on unconstrained, heterogeneous web video, the authors use a multi-view signal proxy (MVS) for filtering and curriculum sampling.
- They also distill expert training into a single student model, and report that pretraining on YouTube-8M (about 100M frames after filtering) improves zero-shot transfer and fine-tuning versus prior self-supervised baselines.
- The authors release the pretrained checkpoint and code to support replication and further research.




