Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping

arXiv cs.CV / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key bottleneck in high-resolution remote sensing mapping: directly fusing global geospatial foundation-model embeddings with local high-resolution features can cause feature interference and degrade spatial structure due to a semantic–spatial gap.
  • It proposes a Structure-Semantic Decoupled Modulation (SSDM) framework that splits global representations into two cross-modal injection pathways: a structural prior modulation branch and a global semantic injection branch.
  • The structural prior branch injects macroscopic receptive-field priors into self-attention layers of the high-resolution encoder to reduce fragmentation and stabilize local feature extraction under high-frequency noise and intra-class variance.
  • The semantic injection branch aligns holistic context with the deep high-resolution feature space and supplements global semantics through cross-modal integration to improve semantic consistency and category discrimination.
  • Experiments on remote sensing tasks show SSDM achieves state-of-the-art results over existing cross-modal fusion approaches and improves mapping accuracy across multiple scenarios.

Abstract

Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.