AI Navigate

Local Precise Refinement: A Dual-Gated Mixture-of-Experts for Enhancing Foundation Model Generalization against Spectral Shifts

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

Key Points

  • SpectralMoE is a parameter-efficient fine-tuning framework that uses a dual-gated Mixture-of-Experts to perform local, spatially adaptive refinement of foundation-model features for domain generalization in spectral remote sensing.
  • It routes visual and depth features to top-k experts in a modality-specific manner, guided by depth estimates from RGB bands to tailor refinements.
  • A cross-attention mechanism then fuses the refined structural cues back into the visual stream, reducing semantic confusion caused by spectral shifts.
  • Extensive experiments show state-of-the-art results on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB imagery, highlighting robustness to unseen domains.

Abstract

Domain Generalization Semantic Segmentation (DGSS) in spectral remote sensing is severely challenged by spectral shifts across diverse acquisition conditions, which cause significant performance degradation for models deployed in unseen domains. While Parameter-Efficient Fine-Tuning (PEFT) on foundation models is a promising direction, existing methods employ global, homogeneous adjustments. This "one-size-fits-all" tuning struggles with the spatial heterogeneity of land cover, causing semantic confusion. We argue that the key to robust DGSS lies not in a single global adaptation, but in performing fine-grained, spatially-adaptive refinement of a foundation model's features. To achieve this, we propose SpectralMoE, a novel PEFT framework for DGSS. It operationalizes this principle by utilizing a Mixture-of-Experts (MoE) architecture to perform local precise refinement on the foundation model's features, incorporating depth features estimated from selected RGB bands of the spectral remote sensing imagery to guide the fine-tuning process. Specifically, SpectralMoE employs a dual-gated MoE architecture that independently routes visual and depth features to top-k selected experts for specialized refinement, enabling modality-specific adjustments. A subsequent cross-attention mechanism then judiciously fuses the refined structural cues into the visual stream, mitigating semantic ambiguities caused by spectral variations. Extensive experiments show that SpectralMoE sets a new state-of-the-art on multiple DGSS benchmarks across hyperspectral, multispectral, and RGB remote sensing imagery.