Stereo World Model: Camera-Guided Stereo Video Generation

arXiv cs.CV / 3/19/2026

📰 NewsModels & Research

共有:

Key Points

StereoWorld is a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation within the RGB modality, grounding geometry from disparity.
It introduces two key designs: a unified camera-frame RoPE for camera-aware positional encoding and a stereo-aware attention decomposition that uses 3D intra-view attention plus horizontal row attention guided by epipolar priors.
Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity, achieving more than 3x faster generation and about a 5% gain in viewpoint consistency over monocular-then-convert pipelines.
Beyond benchmarks, it enables end-to-end binocular VR rendering without depth estimation or inpainting and supports metric-scale depth grounding to aid embodied policy learning.
It is compatible with long-video distillation for extended interactive stereo synthesis.

Abstract

We present StereoWorld, a camera-conditioned stereo world model that jointly learns appearance and binocular geometry for end-to-end stereo video generation.Unlike monocular RGB or RGBD approaches, StereoWorld operates exclusively within the RGB modality, while simultaneously grounding geometry directly from disparity. To efficiently achieve consistent stereo generation, our approach introduces two key designs: (1) a unified camera-frame RoPE that augments latent tokens with camera-aware rotary positional encoding, enabling relative, view- and time-consistent conditioning while preserving pretrained video priors via a stable attention initialization; and (2) a stereo-aware attention decomposition that factors full 4D attention into 3D intra-view attention plus horizontal row attention, leveraging the epipolar prior to capture disparity-aligned correspondences with substantially lower compute. Across benchmarks, StereoWorld improves stereo consistency, disparity accuracy, and camera-motion fidelity over strong monocular-then-convert pipelines, achieving more than 3x faster generation with an additional 5% gain in viewpoint consistency. Beyond benchmarks, StereoWorld enables end-to-end binocular VR rendering without depth estimation or inpainting, enhances embodied policy learning through metric-scale depth grounding, and is compatible with long-video distillation for extended interactive stereo synthesis.

When AI Grows Up: Identity, Memory, and What Persists Across Versions

Dev.to

OpenAI is throwing everything into building a fully automated researcher

MIT Technology Review

Kimi just published a paper replacing residual connections in transformers. results look legit

Reddit r/LocalLLaMA

機械学習の最適化対象まとめ（E資格対策にも）

Qiita

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

Dev.to

Stereo World Model: Camera-Guided Stereo Video Generation

Key Points

Abstract

Related Articles

When AI Grows Up: Identity, Memory, and What Persists Across Versions

OpenAI is throwing everything into building a fully automated researcher

Kimi just published a paper replacing residual connections in transformers. results look legit

機械学習の最適化対象まとめ（E資格対策にも）

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer