Dual-Foundation Models for Unsupervised Domain Adaptation

arXiv cs.CV / 5/6/2026

📰 NewsModels & Research

Key Points

  • The paper tackles unsupervised domain adaptation (UDA) for semantic segmentation by addressing the persistent domain gap between labeled synthetic data and unlabeled real images.
  • It identifies two weaknesses in prior methods: dependence on high-confidence pseudo-labels that limits learning coverage, and prototype/contrastive approaches that use biased, unstable anchors from source-trained models.
  • The proposed dual-foundation framework combines SAM with superpixel-guided prompting to learn from a wider set of target pixels beyond only high-confidence predictions.
  • It also integrates DINOv3 to build stable, domain-invariant class prototypes via robust representation learning, improving alignment during adaptation.
  • Experiments on GTA→Cityscapes and SYNTHIA→Cityscapes show consistent gains of +1.3% and +1.4% mIoU over strong UDA baselines, respectively.

Abstract

Semantic segmentation provides pixel-level scene understanding essential for autonomous driving and fine-grained perception tasks. However, training segmentation models requires costly, labor-intensive annotations on real-world datasets. Unsupervised Domain Adaptation (UDA) addresses this by training models on labeled synthetic data and adapting them to unlabeled real images. While conceptually simple, adaptation is challenging due to the domain gap, i.e., differences in visual appearance and scene structure between synthetic and real data. Prior approaches bridge this gap through pixel-level mixing or feature-level contrastive learning. Yet, these techniques suffer from two major limitations: (1) reliance on high-confidence pseudo-labels restricts learning to a subset of the target domain, and (2) prototype-based contrastive methods initialize class prototypes from source-trained models, yielding biased and unstable anchors during adaptation. To address these issues, we propose a dual-foundation UDA framework that leverages two complementary foundation models. First, we employ the Segment Anything Model (SAM) with superpixel-guided prompting to enable learning from a broader range of target pixels beyond high-confidence predictions. Second, we incorporate DINOv3 to construct stable, domain-invariant class prototypes through its robust representation learning. Our method achieves consistent improvements of +1.3% and +1.4% mIoU over strong UDA baselines on GTA-to-Cityscapes and SYNTHIA-to-Cityscapes, respectively.