SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
arXiv cs.CV / 4/10/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- SceneScribe-1M is introduced as a new large-scale, multi-modal video dataset with one million in-the-wild videos combining rich semantic and spatio-temporal information.
- Each video is annotated with detailed text descriptions plus precise camera parameters, dense depth maps, and consistent 3D point tracks to support unified 3D perception and video understanding.
- The dataset is benchmarked on both perception/reconstruction tasks (e.g., monocular depth estimation, scene reconstruction, dynamic point tracking) and generative tasks such as text-to-video synthesis with optional camera control.
- The authors plan to open-source SceneScribe-1M to accelerate research into models that can jointly perceive dynamic 3D scenes and generate controllable, realistic video.
Related Articles

Black Hat Asia
AI Business
v0.20.5
Ollama Releases

Inside Anthropic's Project Glasswing: The AI Model That Found Zero-Days in Every Major OS
Dev.to
Gemma 4 26B fabricated an entire code audit. I have the forensic evidence from the database.
Reddit r/LocalLLaMA
SoloEngine: Low-Code Agentic AI Development Platform with Native Support for Multi-Agent Collaboration, MCP, and Skill System
Dev.to