SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

arXiv cs.CV / 4/10/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

SceneScribe-1M is introduced as a new large-scale, multi-modal video dataset with one million in-the-wild videos combining rich semantic and spatio-temporal information.
Each video is annotated with detailed text descriptions plus precise camera parameters, dense depth maps, and consistent 3D point tracks to support unified 3D perception and video understanding.
The dataset is benchmarked on both perception/reconstruction tasks (e.g., monocular depth estimation, scene reconstruction, dynamic point tracking) and generative tasks such as text-to-video synthesis with optional camera control.
The authors plan to open-source SceneScribe-1M to accelerate research into models that can jointly perceive dynamic 3D scenes and generate controllable, realistic video.

Abstract

The convergence of 3D geometric perception and video synthesis has created an unprecedented demand for large-scale video data that is rich in both semantic and spatio-temporal information. While existing datasets have advanced either 3D understanding or video generation, a significant gap remains in providing a unified resource that supports both domains at scale. To bridge this chasm, we introduce SceneScribe-1M, a new large-scale, multi-modal video dataset. It comprises one million in-the-wild videos, each meticulously annotated with detailed textual descriptions, precise camera parameters, dense depth maps, and consistent 3D point tracks. We demonstrate the versatility and value of SceneScribe-1M by establishing benchmarks across a wide array of downstream tasks, including monocular depth estimation, scene reconstruction, and dynamic point tracking, as well as generative tasks such as text-to-video synthesis, with or without camera control. By open-sourcing SceneScribe-1M, we aim to provide a comprehensive benchmark and a catalyst for research, fostering the development of models that can both perceive the dynamic 3D world and generate controllable, realistic video content.

Black Hat Asia

AI Business

v0.20.5

Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Dev.to

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

Reddit r/LocalLLaMA

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System

Dev.to

SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations

Key Points

Abstract

Related Articles

Black Hat Asia

v0.20.5

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS

Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.

SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer