Stereo World Model: Camera-Guided Stereo Video Generation
arXiv cs.CV / 3/19/2026
📰 NewsModels & Research
Key Points
- StereoWorld is a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation within the RGB modality, grounding geometry from disparity.
- It introduces two key designs: a unified camera-frame RoPE for camera-aware positional encoding and a stereo-aware attention decomposition that uses 3D intra-view attention plus horizontal row attention guided by epipolar priors.
- Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity, achieving more than 3x faster generation and about a 5% gain in viewpoint consistency over monocular-then-convert pipelines.
- Beyond benchmarks, it enables end-to-end binocular VR rendering without depth estimation or inpainting and supports metric-scale depth grounding to aid embodied policy learning.
- It is compatible with long-video distillation for extended interactive stereo synthesis.
Related Articles
When AI Grows Up: Identity, Memory, and What Persists Across Versions
Dev.to
OpenAI is throwing everything into building a fully automated researcher
MIT Technology Review
Kimi just published a paper replacing residual connections in transformers. results look legit
Reddit r/LocalLLaMA
機械学習の最適化対象まとめ(E資格対策にも)
Qiita

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026
Dev.to