HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
arXiv cs.CV / 4/17/2026
📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research
Key Points
- HY-World 2.0 is a multi-modal world model that takes text, single-view images, multi-view images, and videos as inputs to produce 3D world representations.
- Using text or single-view inputs, it generates high-fidelity, navigable 3D Gaussian Splatting (3DGS) scenes via a four-stage pipeline: Panorama Generation (HY-Pano 2.0), Trajectory Planning (WorldNav), World Expansion (WorldStereo 2.0), and World Composition (WorldMirror 2.0).
- The framework introduces upgrades to panorama fidelity and improves both 3D scene understanding/planning and multi-view/video-based reconstruction through refinements to WorldStereo and WorldMirror.
- It also provides WorldLens, a high-performance, engine-agnostic 3DGS rendering platform with features like automatic IBL lighting, efficient collision detection, and training-rendering co-design to support interactive exploration with characters.
- Experiments on multiple benchmarks show state-of-the-art results among open-source methods, with performance comparable to the closed-source model Marble, and the authors release model weights, code, and technical details for reproducibility.



![[Patterns] AI Agent Error Handling That Actually Works](/_next/image?url=https%3A%2F%2Fmedia2.dev.to%2Fdynamic%2Fimage%2Fwidth%3D1200%2Cheight%3D627%2Cfit%3Dcover%2Cgravity%3Dauto%2Cformat%3Dauto%2Fhttps%253A%252F%252Fdev-to-uploads.s3.amazonaws.com%252Fuploads%252Farticles%252Frn5czaopq2vzo7cglady.png&w=3840&q=75)