Lean Learning Beyond Clouds: Efficient Discrepancy-Conditioned Optical-SAR Fusion for Semantic Segmentation

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses how cloud occlusion in optical remote sensing imagery harms semantic segmentation, and argues that optical–SAR fusion is needed for robustness but remains hard to model efficiently under cloud interference.
  • It proposes EDC, an efficiency-oriented, discrepancy-conditioned fusion framework that uses a tri-stream encoder with Carrier Tokens to capture global context with significantly lower complexity.
  • EDC introduces Discrepancy-Conditioned Hybrid Fusion (DCHF) to selectively suppress unreliable regions so cloud-induced noise is not propagated during global aggregation.
  • To improve semantic consistency under occlusion, the method adds an auxiliary cloud-removal branch trained with teacher-guided distillation.
  • Experiments report better accuracy and efficiency, including mIoU gains of 0.56% (M3M-CR) and 0.88% (WHU-OPT-SAR), a 46.7% parameter reduction, and ~1.98× faster inference, with code released on GitHub.

Abstract

Cloud occlusion severely degrades the semantic integrity of optical remote sensing imagery. While incorporating Synthetic Aperture Radar (SAR) provides complementary observations, achieving efficient global modeling and reliable cross-modal fusion under cloud interference remains challenging. Existing methods rely on dense global attention to capture long-range dependencies, yet such aggregation indiscriminately propagates cloud-induced noise. Improving robustness typically entails enlarging model capacity, which further increases computational overhead. Given the large-scale and high-resolution nature of remote sensing applications, such computational demands hinder practical deployment, leading to an efficiency-reliability trade-off. To address this dilemma, we propose EDC, an efficiency-oriented and discrepancy-conditioned optical-SAR semantic segmentation framework. A tri-stream encoder with Carrier Tokens enables compact global context modeling with reduced complexity. To prevent noise contamination, we introduce a Discrepancy-Conditioned Hybrid Fusion (DCHF) mechanism that selectively suppresses unreliable regions during global aggregation. In addition, an auxiliary cloud removal branch with teacher-guided distillation enhances semantic consistency under occlusion. Extensive experiments demonstrate that EDC achieves superior accuracy and efficiency, improving mIoU by 0.56\% and 0.88\% on M3M-CR and WHU-OPT-SAR, respectively, while reducing the number of parameters by 46.7\% and accelerating inference by 1.98\times. Our implementation is available at https://github.com/mengcx0209/EDC.