Infrastructure-Centric World Models: Bridging Temporal Depth and Spatial Breadth for Roadside Perception

arXiv cs.RO / 4/21/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Infrastructure-centric World Models (I-WM) to complement existing ego-vehicle-focused world models by leveraging a persistent, bird’s-eye roadside viewpoint.
  • It highlights spatio-temporal complementarity: fixed roadside sensors provide long-term temporal depth and rare safety-critical event coverage, while vehicle sensors provide wider spatial sampling across road networks.
  • The authors outline a three-phase roadmap: (1) generative scene understanding with quality-aware uncertainty propagation, (2) physics-informed predictive dynamics using multi-agent counterfactual reasoning, and (3) collaborative world models for V2X through latent space alignment.
  • The architecture is described as a dual-layer system with annotation-free perception acting as a multimodal data engine that feeds end-to-end generative world models, using a phased sensor strategy from LiDAR to 4D radar, signal phase data, and event cameras.
  • The work also introduces “Infrastructure VLA (I-VLA)” to unify roadside perception, language commands, and traffic control actions, and compares the vision to related paradigms such as JEPA, spatial intelligence, and VLA.

Abstract

World models, generative AI systems that simulate how environments evolve, are transforming autonomous driving, yet all existing approaches adopt an ego-vehicle perspective, leaving the infrastructure viewpoint unexplored. We argue that infrastructure-centric world models offer a fundamentally complementary capability: the bird's-eye, multi-sensor, persistent viewpoint that roadside systems uniquely possess. Central to our thesis is a spatio-temporal complementarity: fixed roadside sensors excel at temporal depth, accumulating long-term behavioral distributions including rare safety-critical events, while vehicle-borne sensors excel at spatial breadth, sampling diverse scenes across large road networks. This paper presents a vision for Infrastructure-centric World Models (I-WM) in three phases: (I) generative scene understanding with quality-aware uncertainty propagation, (II) physics-informed predictive dynamics with multi-agent counterfactual reasoning, and (III) collaborative world models for V2X communication via latent space alignment. We propose a dual-layer architecture, annotation-free perception as a multi-modal data engine feeding end-to-end generative world models, with a phased sensor strategy from LiDAR through 4D radar and signal phase data to event cameras. We establish a taxonomy of driving world model paradigms, position I-WM relative to LeCun's JEPA, Li Fei-Fei's spatial intelligence, and VLA architectures, and introduce Infrastructure VLA (I-VLA) as a novel unification of roadside perception, language commands, and traffic control actions. Our vision builds upon existing multi-LiDAR pipelines and identifies open-source foundations for each phase, providing a path toward infrastructure that understands and anticipates traffic.