Efficient Equivariant Transformer for Self-Driving Agent Modeling

arXiv cs.LG / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DriveGATr, a transformer-based architecture for self-driving agent behavior modeling that targets key symmetries in driving scenes, including permutation and SE(2) (roto-translation) equivariance.
  • Unlike common approaches that use explicit pairwise relative positional encodings to achieve SE(2)-equivariance (often incurring quadratic cost with the number of agents), DriveGATr avoids this added computational burden.
  • DriveGATr represents scene elements using multivectors from 2D projective geometric algebra (\mathbb{R}^*_{2,0,1}) and applies a stack of equivariant transformer blocks to process these representations.
  • The method achieves geometric relationship modeling through standard attention operating on multivectors, removing the need for costly explicit pairwise encodings.
  • Experiments on the Waymo Open Motion Dataset show DriveGATr matches state-of-the-art traffic simulation performance while offering a better performance-vs-compute Pareto tradeoff.

Abstract

Accurately modeling agent behaviors is an important task in self-driving. It is also a task with many symmetries, such as equivariance to the order of agents and objects in the scene or equivariance to arbitrary roto-translations of the entire scene as a whole; i.e., SE(2)-equivariance. The transformer architecture is a ubiquitous tool for modeling these symmetries. While standard self-attention is inherently permutation equivariant, explicit pairwise relative positional encodings have been the standard for introducing SE(2)-equivariance. However, this approach introduces an additional cost that is quadratic in the number of agents, limiting its scalability to larger scenes and batch sizes. In this work, we propose DriveGATr, a novel transformer-based architecture for agent modeling that achieves SE(2)-equivariance without the computational cost of existing methods. Inspired by recent advances in geometric deep learning, DriveGATr encodes scene elements as multivectors in the 2D projective geometric algebra \mathbb{R}^*_{2,0,1} and processes them with a stack of equivariant transformer blocks. Crucially, DriveGATr models geometric relationships using standard attention between multivectors, eliminating the need for costly explicit pairwise relative positional encodings. Experiments on the Waymo Open Motion Dataset demonstrate that DriveGATr is comparable to the state-of-the-art in traffic simulation and establishes a superior Pareto front for performance vs computational cost.