C$^2$T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic--Vehicle Coordination

arXiv cs.RO / 4/16/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multi-agent reinforcement learning for urban traffic control is limited by hand-crafted, short-sighted rewards that do not reflect human-centric objectives like safety, stability, and comfort.
  • It introduces C2T, a framework that distills “common-sense” from a large language model into a learned intrinsic reward function for traffic–vehicle coordination.
  • The learned LLM-aligned reward is used to train a cooperative multi-intersection traffic-light controller in a CityFlow-based benchmark setting.
  • Experiments show C2T improves performance over strong MARL baselines on traffic efficiency, safety, and an energy-related proxy.
  • The method is presented as flexible, enabling different coordination behaviors (e.g., efficiency-focused vs. safety-focused) by changing the LLM prompt used for reward distillation.

Abstract

State-of-the-art (SOTA) urban traffic control increasingly employs Multi-Agent Reinforcement Learning (MARL) to coordinate Traffic Light Controllers (TLCs) and Connected Autonomous Vehicles (CAVs). However, the performance of these systems is fundamentally capped by their hand-crafted, myopic rewards (e.g., intersection pressure), which fail to capture high-level, human-centric goals like safety, flow stability, and comfort. To overcome this limitation, we introduce C2T, a novel framework that learns a common-sense coordination model from traffic-vehicle dynamics. C2T distills "common-sense" knowledge from a Large Language Model (LLM) into a learned intrinsic reward function. This new reward is then used to guide the coordination policy of a cooperative multi-intersection TLC MARL system on CityFlow-based multi-intersection benchmarks. Our framework significantly outperforms strong MARL baselines in traffic efficiency, safety, and an energy-related proxy. We further highlight C2T's flexibility in principle, allowing distinct "efficiency-focused" versus "safety-focused" policies by modifying the LLM prompt.