Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model

arXiv cs.RO / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces TTSG, a modular framework that generates realistic and controllable autonomous-driving traffic scenes from natural language while enforcing spatial validity and semantic coherence.
It addresses core challenges such as grounding free-form text into feasible layouts, composing scenarios without predefined locations, and coordinating multi-agent behaviors with road selection.
TTSG uses LLMs as general planners but integrates them into a tightly constrained pipeline with a plan-aware road ranking algorithm to keep agent actions consistent with road geometry.
Experiments on SafeBench report an average collision rate of 3.5% across three critical scenarios, indicating strong safety-oriented scene generation.
The generated scenes also improve driving captioning/action reasoning, with reported gains of over 30 CIDEr points after training on TTSG outputs.

Abstract

Generating realistic and controllable traffic scenes from natural language can greatly enhance the development and evaluation of autonomous driving systems. However, this task poses unique challenges: (1) grounding free-form text into spatially valid and semantically coherent layouts, (2) composing scenarios without predefined locations, and (3) planning multi-agent behaviors and selecting roads that respect agents' configurations. To address these, we propose a modular framework, TTSG, comprising prompt analysis, road retrieval, agent planning, and a novel plan-aware road ranking algorithm to solve these challenges. While large language models (LLMs) are used as general planners, our design integrates them into a tightly controlled pipeline that enforces structure, feasibility, and scene diversity. Notably, our ranking strategy ensures consistency between agent actions and road geometry, enabling scene generation without predefined routes or spawn points. The framework supports both routine and safety-critical scenarios, as well as multi-stage event composition. Experiments on SafeBench demonstrate that our method achieves the lowest average collision rate (3.5\%) across three critical scenarios. Moreover, driving captioning models trained on our generated scenes improve action reasoning by over 30 CIDEr points. These results underscore our proposed framework for flexible, interpretable, and safety-oriented simulation.