AI Navigate

Enactor: From Traffic Simulators to Surrogate World Models

arXiv cs.LG / 3/20/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces an actor-centric generative model based on transformers that captures both actor–actor interactions and traffic-intersection geometry to generate physically grounded, long-horizon trajectories.
  • It uses the World Model paradigm to learn behavior and geometry, achieving realistic trajectories with fewer training samples than traditional agent-centric approaches.
  • In a live simulation-in-the-loop setup, initial actor conditions are generated in SUMO and then controlled by the model for 40,000 timesteps (about 4,000 seconds).
  • The evaluation shows the approach outperforms the baseline on traffic-related and aggregate metrics, including more than 10× improvement in KL-divergence.
  • By combining physics-aware dynamics with learned behavior, the framework addresses limitations of existing microsimulators and deep-learning models for urban traffic analysis.

Abstract

Traffic microsimulators are widely used to evaluate road network performance under various ``what-if" conditions. However, the behavior models controlling the actions of the actors are overly simplistic and fails to capture realistic actor-actor interactions. Deep learning-based methods have been applied to model vehicles and pedestrians as ``agents" responding to their surrounding ``environment" (including lanes, signals, and neighboring agents). Although effective in learning actor-actor interaction, these approaches fail to generate physically consistent trajectories over long time periods, and they do not explicitly address the complex dynamics that arise at traffic intersections which is a critical location in urban networks. Inspired by the World Model paradigm, we have developed an actor centric generative model using transformer-based architecture that is able to capture the actor-actor interaction, at the same time understanding the geometry to the traffic intersection to generate physically grounded trajectories that are based on learned behavior. Moreover, we test the model in a live ``simulation-in-the-loop" setting, where we generate the initial conditions of the actors using SUMO and then let the model control the dynamics of the actors. We let the simulation run for 40000 timesteps (4000 seconds), testing the performance of the model on long timerange and evaluating the trajectories on traffic engineering related metrics. Experimental results demonstrate that the proposed framework effectively captures complex actor-actor interactions and generates long-horizon, physically consistent trajectories, while requiring significantly fewer training samples than traditional agent-centric generative approaches. Our model is able to outperform the baseline in traffic related as well as aggregate metrics where our model beats the baseline by more than 10x on the KL-Divergence.