Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset
arXiv cs.AI / 4/27/2026
💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper argues that smart mobility and urban transportation safety require scalable intelligence beyond microscopic autonomous driving, highlighting a lack of city-scale traffic analysis research.
- It introduces the Land Transportation Dataset (LTD), an open-source vision-language dataset with 11.6K safety-oriented VQA pairs collected from heterogeneous roadside cameras across varied road layouts, lighting, participants, and adverse weather.
- LTD is designed to support open-ended reasoning via three integrated tasks—fine-grained grounding, multi-image camera selection, and multi-image risk analysis—so models must infer hazardous objects, causes, and risky directions from minimally correlated views.
- To improve label quality, the authors use multi-model vision-language generation plus cross-validation and human-in-the-loop refinement, then train UniVLT, a transportation foundation model that unifies microscopic AD reasoning and macroscopic traffic analysis.
- Experiments on LTD and multiple autonomous-driving benchmarks show UniVLT reaches state-of-the-art performance for open-ended reasoning, while also revealing limitations of existing foundation models under complex multi-view traffic conditions.




