Multimodal Urban Tree Detection from Satellite and Street-Level Imagery via Annotation-Efficient Deep Learning Strategies

arXiv cs.CV / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes an annotation-efficient multimodal framework that combines high-resolution satellite imagery with Google Street View to detect urban trees more scalably than labor-intensive field surveys.
  • It uses satellite data to localize likely tree candidates and then retrieves targeted street-level views, reducing wasteful full-area street sampling while improving detection detail.
  • To overcome limited annotations when moving to new regions, it applies domain adaptation to transfer learned knowledge from an existing annotated dataset to a new region of interest.
  • The study evaluates semi-supervised learning, active learning, and a hybrid strategy with a transformer-based detector, finding the hybrid approach yields the best results with an F1-score of 0.90 (about a 12% gain over the baseline).
  • Error analysis indicates that active and hybrid strategies reduce both false positives and false negatives, and semi-supervised learning can degrade over time due to confirmation bias from pseudo-labeling.

Abstract

Beyond the immediate biophysical benefits, urban trees play a foundational role in environmental sustainability and disaster mitigation. Precise mapping of urban trees is essential for environmental monitoring, post-disaster assessment, and strengthening policy. However, the transition from traditional, labor-intensive field surveys to scalable automated systems remains limited by high annotation costs and poor generalization across diverse urban scenarios. This study introduces a multimodal framework that integrates high-resolution satellite imagery with ground-level Google Street View to enable scalable and detailed urban tree detection under limited-annotation conditions. The framework first leverages satellite imagery to localize tree candidates and then retrieves targeted ground-level views for detailed detection, significantly reducing inefficient street-level sampling. To address the annotation bottleneck, domain adaptation is used to transfer knowledge from an existing annotated dataset to a new region of interest. To further minimize human effort, we evaluated three learning strategies: semi-supervised learning, active learning, and a hybrid approach combining both, using a transformer-based detection model. The hybrid strategy achieved the best performance with an F1-score of 0.90, representing a 12% improvement over the baseline model. In contrast, semi-supervised learning exhibited progressive performance degradation due to confirmation bias in pseudo-labeling, while active learning steadily improved results through targeted human intervention to label uncertain or incorrect predictions. Error analysis further showed that active and hybrid strategies reduced both false positives and false negatives. Our findings highlight the importance of a multimodal approach and guided annotation for scalable, annotation-efficient urban tree mapping to strengthen sustainable city planning.