Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes OneTrackerV2, a unified end-to-end multimodal visual object tracking framework designed to handle any input modality without training separate models per modality.
  • It introduces Meta Merger to map multimodal information into a shared representation space, enabling flexible modality fusion and improving robustness.
  • It presents Dual Mixture-of-Experts (DMoE), where one MoE component (T-MoE) models spatio-temporal relationships for tracking and another (M-MoE) embeds multimodal knowledge to reduce feature conflicts and disentangle cross-modal dependencies.
  • OneTrackerV2 reports state-of-the-art results across five RGB and RGB+X tracking tasks and 12 benchmarks, while keeping high inference efficiency and maintaining performance after model compression.
  • The method also shows strong robustness when modalities are missing during inference, addressing a key practical limitation of many existing multimodal trackers.

Abstract

Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.