Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception
arXiv cs.RO / 4/21/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes Infrastructure-centric World Models (I-WM) to complement existing ego-vehicle-focused world models by leveraging a persistent, bird’s-eye roadside viewpoint.
- It highlights spatio-temporal complementarity: fixed roadside sensors provide long-term temporal depth and rare safety-critical event coverage, while vehicle sensors provide wider spatial sampling across road networks.
- The authors outline a three-phase roadmap: (1) generative scene understanding with quality-aware uncertainty propagation, (2) physics-informed predictive dynamics using multi-agent counterfactual reasoning, and (3) collaborative world models for V2X through latent space alignment.
- The architecture is described as a dual-layer system with annotation-free perception acting as a multimodal data engine that feeds end-to-end generative world models, using a phased sensor strategy from LiDAR to 4D radar, signal phase data, and event cameras.
- The work also introduces “Infrastructure VLA (I-VLA)” to unify roadside perception, language commands, and traffic control actions, and compares the vision to related paradigms such as JEPA, spatial intelligence, and VLA.



