Learning Progressive Adaptation for Multi-Modal Tracking

arXiv cs.CV / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces PATrack, a progressive adaptation framework for multi-modal tracking that aims to better transfer RGB pre-trained models to modalities such as Thermal, Depth, and Event data.
  • It addresses limitations of common parameter-efficient fine-tuning by adding three coordinated adapter types: modality-dependent (enhances intra-modal representation via high/low-frequency decomposition), modality-entangled (uses cross-attention to improve inter-modal feature reliability), and a task-level adapter for the prediction head to handle fused-information mismatch.
  • PATrack is designed to explicitly modulate adaptation at the single-modality level, the cross-modal interaction level, and the prediction-head level within one unified architecture.
  • Extensive experiments across RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks reportedly achieve performance gains over state-of-the-art approaches.
  • The authors provide code via a public GitHub repository to support reproducibility and further experimentation.

Abstract

Due to the limited availability of paired multi-modal data, multi-modal trackers are typically built by adopting pre-trained RGB models with parameter-efficient fine-tuning modules. However, these fine-tuning methods overlook advanced adaptations for applying RGB pre-trained models and fail to modulate a single specific modality, cross-modal interactions, and the prediction head. To address the issues, we propose to perform Progressive Adaptation for Multi-Modal Tracking (PATrack). This innovative approach incorporates modality-dependent, modality-entangled, and task-level adapters, effectively bridging the gap in adapting RGB pre-trained networks to multi-modal data through a progressive strategy. Specifically, modality-specific information is enhanced through the modality-dependent adapter, decomposing the high- and low-frequency components, which ensures a more robust feature representation within each modality. The inter-modal interactions are introduced in the modality-entangled adapter, which implements a cross-attention operation guided by inter-modal shared information, ensuring the reliability of features conveyed between modalities. Additionally, recognising that the strong inductive bias of the prediction head does not adapt to the fused information, a task-level adapter specific to the prediction head is introduced. In summary, our design integrates intra-modal, inter-modal, and task-level adapters into a unified framework. Extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event tracking tasks demonstrate that our method shows impressive performance against state-of-the-art methods. Code is available at https://github.com/ouha1998/Learning-Progressive-Adaptation-for-Multi-Modal-Tracking.