UAV as Urban Construction Change Monitor: A New Benchmark and Change Captioning Model

arXiv cs.CV / 5/7/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The paper addresses Remote Sensing Image Change Captioning by moving from binary change masks to spatially grounded, semantic natural-language descriptions of scene evolution.
It proposes PTNet, a prototype-guided, task-adaptive framework that models structured change semantics and combines change detection priors to improve coherence between detected changes and generated captions.
PTNet uses a learnable prototype bank for cross-temporal interaction, multi-head gating to separate task-specific representations, and detection-derived spatial priors during caption generation to retain fine-grained spatial sensitivity.
The authors introduce UCCD, a UAV-based large-scale benchmark with 9,000 high-resolution bi-temporal image pairs and 45,000 annotated sentences focused on urban construction monitoring.
Experiments on UCCD and WHU-CDC show PTNet consistently outperforms prior methods, and the dataset and code are released publicly.

Abstract

Remote Sensing Image Change Captioning (RSICC) aims to generate spatially grounded natural language descriptions of scene evolution from bi-temporal imagery, moving beyond binary change masks toward semantic-level understanding. However, existing methods rely on implicit feature differencing without explicitly modeling structured change semantics, and struggle to reconcile the conflicting representation demands of change detection and caption generation. In addition, current benchmarks provide limited coverage of high-resolution urban construction scenarios. To address these challenges, we propose PTNet, a prototype-guided task-adaptive framework for joint change captioning and detection. PTNet explicitly models structured change semantics through a learnable prototype bank that guides cross-temporal interaction, disentangles task-specific representations via multi-head gating, and injects detection-derived spatial priors into caption generation, enabling coherent semantic correspondence while preserving fine-grained spatial sensitivity. Furthermore, we construct UCCD, a large-scale UAV-based benchmark comprising 9,000 high-resolution image pairs and 45,000 annotated sentences for urban construction monitoring. Extensive experiments on UCCD and WHU-CDC demonstrate that PTNet consistently outperforms existing methods. The dataset and source code are publicly available at https://github.com/G124556/ptnet.