AI Navigate

Revisiting Cross-Attention Mechanisms: Leveraging Beneficial Noise for Domain-Adaptive Learning

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces beneficial noise to regularize cross-attention in unsupervised domain adaptation, encouraging the model to ignore style distractions and focus on content.
  • It proposes the Domain-Adaptive Transformer (DAT) to disentangle domain-shared content from domain-specific style.
  • It also introduces the Cross-Scale Matching (CSM) module to align features across multiple resolutions while preserving semantic consistency.
  • DACSM achieves state-of-the-art performance across VisDA-2017, Office-Home, and DomainNet, including a +2.3% improvement over CDTrans on VisDA-2017 and a +5.9% gain on the 'truck' class.
  • The work demonstrates that combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment can yield robust, content-consistent representations for cross-domain learning.

Abstract

Unsupervised Domain Adaptation (UDA) seeks to transfer knowledge from a labeled source domain to an unlabeled target domain but often suffers from severe domain and scale gaps that degrade performance. Existing cross-attention-based transformers can align features across domains, yet they struggle to preserve content semantics under large appearance and scale variations. To explicitly address these challenges, we introduce the concept of beneficial noise, which regularizes cross-attention by injecting controlled perturbations, encouraging the model to ignore style distractions and focus on content. We propose the Domain-Adaptive Cross-Scale Matching (DACSM) framework, which consists of a Domain-Adaptive Transformer (DAT) for disentangling domain-shared content from domain-specific style, and a Cross-Scale Matching (CSM) module that adaptively aligns features across multiple resolutions. DAT incorporates beneficial noise into cross-attention, enabling progressive domain translation with enhanced robustness, yielding content-consistent and style-invariant representations. Meanwhile, CSM ensures semantic consistency under scale changes. Extensive experiments on VisDA-2017, Office-Home, and DomainNet demonstrate that DACSM achieves state-of-the-art performance, with up to +2.3% improvement over CDTrans on VisDA-2017. Notably, DACSM achieves a +5.9% gain on the challenging "truck" class of VisDA, evidencing the strength of beneficial noise in handling scale discrepancies. These results highlight the effectiveness of combining domain translation, beneficial-noise-enhanced attention, and scale-aware alignment for robust cross-domain representation learning.