Towards Safe Mobility: A Unified Transportation Foundation Model enabled by Open-Ended Vision-Language Dataset

arXiv cs.AI / 4/27/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper argues that smart mobility and urban transportation safety require scalable intelligence beyond microscopic autonomous driving, highlighting a lack of city-scale traffic analysis research.
It introduces the Land Transportation Dataset (LTD), an open-source vision-language dataset with 11.6K safety-oriented VQA pairs collected from heterogeneous roadside cameras across varied road layouts, lighting, participants, and adverse weather.
LTD is designed to support open-ended reasoning via three integrated tasks—fine-grained grounding, multi-image camera selection, and multi-image risk analysis—so models must infer hazardous objects, causes, and risky directions from minimally correlated views.
To improve label quality, the authors use multi-model vision-language generation plus cross-validation and human-in-the-loop refinement, then train UniVLT, a transportation foundation model that unifies microscopic AD reasoning and macroscopic traffic analysis.
Experiments on LTD and multiple autonomous-driving benchmarks show UniVLT reaches state-of-the-art performance for open-ended reasoning, while also revealing limitations of existing foundation models under complex multi-view traffic conditions.

Abstract

Urban transportation systems face growing safety challenges that require scalable intelligence for emerging smart mobility infrastructures. While recent advances in foundation models and large-scale multimodal datasets have strengthened perception and reasoning in intelligent transportation systems (ITS), existing research remains largely centered on microscopic autonomous driving (AD), with limited attention to city-scale traffic analysis. In particular, open-ended safety-oriented visual question answering (VQA) and corresponding foundation models for reasoning over heterogeneous roadside camera observations remain underexplored. To address this gap, we introduce the Land Transportation Dataset (LTD), a large-scale open-source vision-language dataset for open-ended reasoning in urban traffic environments. LTD contains 11.6K high-quality VQA pairs collected from heterogeneous roadside cameras, spanning diverse road geometries, traffic participants, illumination conditions, and adverse weather. The dataset integrates three complementary tasks: fine-grained multi-object grounding, multi-image camera selection, and multi-image risk analysis, requiring joint reasoning over minimally correlated views to infer hazardous objects, contributing factors, and risky road directions. To ensure annotation fidelity, we combine multi-model vision-language generation with cross-validation and human-in-the-loop refinement. Building upon LTD, we further propose UniVLT, a transportation foundation model trained via curriculum-based knowledge transfer to unify microscopic AD reasoning and macroscopic traffic analysis within a single architecture. Extensive experiments on LTD and multiple AD benchmarks demonstrate that UniVLT achieves SOTA performance on open-ended reasoning tasks across diverse domains, while exposing limitations of existing foundation models in complex multi-view traffic scenarios.