A generalised pre-training strategy for deep learning networks in semantic segmentation of remotely sensed images

arXiv cs.CV / 5/1/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets a key bottleneck in remote-sensing semantic segmentation: models pre-trained on ImageNet often underperform after fine-tuning due to large domain gaps between natural images and remote-sensing data.
  • It proposes a novel but simple generalized pre-training strategy that discourages the model from overfitting to domain-specific features found in the pre-training dataset, aiming to improve transfer generalization.
  • Experiments pre-train on ImageNet and then fine-tune on four diverse remote-sensing segmentation datasets (iSAID, MFNet, PST900, Potsdam) to test robustness across scenes and modalities.
  • The approach achieves state-of-the-art results across all evaluated datasets, reaching 67.4% mIoU (iSAID), 56.9% mIoU (MFNet), 84.22% mIoU (PST900), and 91.88% mF1 (Potsdam).
  • The authors position the work as groundwork toward a unified foundation model spanning both general computer vision and remote-sensing applications.

Abstract

In the segmentation of remotely sensed images, deep learning models are typically pre-trained using large image databases like ImageNet before fine-tuned on domain-specific datasets. However, the performance of these fine-tuned models is often hindered by the large domain gaps (i.e., differences in scenes and modalities) between ImageNet's images and remotely sensed images being processed. Therefore, many researchers have undertaken efforts to establish large-scale domain-specific image datasets for pre-training, aiming to enhance model performance. However, establishing such datasets is often challenging, requiring significant effort, and these datasets often exhibit limited generaliza-bility to other application scenarios. To address these issues, this study introduces a novel yet simple pre-training strategy designed to guide a model away from learning domain-specific features in a pre-training dataset during pre-training, thereby improving the generalisation ability of the pre-trained model. To evaluate the strategy's effectiveness, deep learning models are pre-trained on ImageNet and subsequently fine-tuned on four semantic segmentation datasets with diverse scenes and modalities, including iSAID, MFNet, PST900 and Potsdam. Experimental results show that the proposed pre-training strategy led to state-of-the-art accuracies on all four datasets, namely 67.4% mIoU for iSAID, 56.9% mIoU for MFNet, 84.22% mIoU for PST900, 91.88% mF1 for Potsdam. This research lays the groundwork for developing a unified foundation model applicable to both computer vision and remote sensing applications.