GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes

arXiv cs.CV / 4/22/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces GOLD-BEV, a framework for learning dense, scene-centric semantic BEV maps that include dynamic agents using ego-centric sensors.
  • It uses time-synchronized aerial imagery as training supervision by aligning BEV with aerial crops, which provides an intuitive target and reduces ambiguity compared with ego-only BEV labeling.
  • By enforcing strict aerial-ground synchronization, the method more reliably supervises moving traffic participants and reduces temporal inconsistencies seen in non-synchronized overhead sources.
  • For scalable dense targets, the authors generate BEV pseudo-labels with domain-adapted aerial “teachers” and jointly train BEV segmentation, optionally adding pseudo-aerial BEV reconstruction for interpretability.
  • The approach further synthesizes pseudo-aerial BEV images from ego sensors to enable lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled driving data.

Abstract

Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird's-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.