A global dataset of continuous urban dashcam driving

arXiv cs.CV / 4/2/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The paper introduces CROWD, a manually curated, cross-domain dataset of continuous, front-facing urban dashcam driving segments extracted from publicly available YouTube videos.
  • CROWD contains 51,753 segment records totaling 20,275.56 hours across 7,103 inhabited places in 238 countries/territories on all six inhabited continents, with labels for time of day (day/night) and vehicle type.
  • The dataset is designed for robustness and interaction analysis by focusing on routine driving while explicitly excluding crashes, crash aftermath, and incident-focused or edited content.
  • To support benchmarking, the release provides per-segment CSVs containing machine-generated detections for all 80 MS-COCO classes using YOLOv11x and segment-local multi-object tracks using BoT-SORT.
  • CROWD is distributed via video identifiers and segment boundaries with derived annotations, aiming to enable reproducible research without redistributing the underlying source videos.

Abstract

We introduce CROWD (City Road Observations With Dashcams), a manually curated dataset of ordinary, minute scale, temporally contiguous, unedited, front facing urban dashcam segments screened and segmented from publicly available YouTube videos. CROWD is designed to support cross-domain robustness and interaction analysis by prioritising routine driving and explicitly excluding crashes, crash aftermath, and other edited or incident-focused content. The release contains 51,753 segment records spanning 20,275.56 hours (42,032 videos), covering 7,103 named inhabited places in 238 countries and territories across all six inhabited continents (Africa, Asia, Europe, North America, South America and Oceania), with segment level manual labels for time of day (day or night) and vehicle type. To lower the barrier for benchmarking, we provide per-segment CSV files of machine-generated detections for all 80 MS-COCO classes produced with YOLOv11x, together with segment-local multi-object tracks (BoT-SORT); e.g. person, bicycle, motorcycle, car, bus, truck, traffic light, stop sign, etc. CROWD is distributed as video identifiers with segment boundaries and derived annotations, enabling reproducible research without redistributing the underlying videos.