Toward Efficient and Robust Behavior Models for Multi-Agent Driving Simulation

arXiv cs.RO / 3/30/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a more efficient and robust multi-agent driving behavior model by optimizing how individual traffic participants are represented and encoded for simulation.
  • It introduces an instance-centric scene representation using local coordinate frames for each participant and map element, enabling viewpoint-invariant encoding and reuse of static map tokens across simulation steps.
  • For interaction modeling, it uses a query-centric symmetric context encoder with relative positional encodings to capture relationships between local frames.
  • The behavior model is learned via Adversarial Inverse Reinforcement Learning, with an adaptive reward transformation that automatically trades off robustness versus realism during training.
  • Experimental results indicate improved scaling with the number of tokens and better positional accuracy/robustness than multiple agent-centric baselines, alongside reduced training and inference time.

Abstract

Scalable multi-agent driving simulation requires behavior models that are both realistic and computationally efficient. We address this by optimizing the behavior model that controls individual traffic participants. To improve efficiency, we adopt an instance-centric scene representation, where each traffic participant and map element is modeled in its own local coordinate frame. This design enables efficient, viewpoint-invariant scene encoding and allows static map tokens to be reused across simulation steps. To model interactions, we employ a query-centric symmetric context encoder with relative positional encodings between local frames. We use Adversarial Inverse Reinforcement Learning to learn the behavior model and propose an adaptive reward transformation that automatically balances robustness and realism during training. Experiments demonstrate that our approach scales efficiently with the number of tokens, significantly reducing training and inference times, while outperforming several agent-centric baselines in terms of positional accuracy and robustness.