Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

arXiv cs.CV / 5/6/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes OneTrackerV2, a unified end-to-end multimodal visual object tracking framework designed to handle any input modality without training separate models per modality.
It introduces Meta Merger to map multimodal information into a shared representation space, enabling flexible modality fusion and improving robustness.
It presents Dual Mixture-of-Experts (DMoE), where one MoE component (T-MoE) models spatio-temporal relationships for tracking and another (M-MoE) embeds multimodal knowledge to reduce feature conflicts and disentangle cross-modal dependencies.
OneTrackerV2 reports state-of-the-art results across five RGB and RGB+X tracking tasks and 12 benchmarks, while keeping high inference efficiency and maintaining performance after model compression.
The method also shows strong robustness when modalities are missing during inference, addressing a key practical limitation of many existing multimodal trackers.

Abstract

Multimodal visual object tracking can be divided into to several kinds of tasks (e.g. RGB and RGB+X tracking), based on the input modality. Existing methods often train separate models for each modality or rely on pretrained models to adapt to new modalities, which limits efficiency, scalability, and usability. Thus, we introduce OneTrackerV2, a unified multi-modal tracking framework that enables end-to-end training for any modality. We propose Meta Merger to embed multi-modal information into a unified space, allowing flexible modality fusion and robustness. We further introduce Dual Mixture-of-Experts (DMoE): T-MoE models spatio-temporal relations for tracking, while M-MoE embeds multi-modal knowledge, disentangling cross-modal dependencies and reducing feature conflicts. With a shared architecture, unified parameters, and a single end-to-end training, OneTrackerV2 achieves state-of-the-art performance across five RGB and RGB+X tracking tasks and 12 benchmarks, while maintaining high inference efficiency. Notably, even after model compression, OneTrackerV2 retains strong performance. Moreover, OneTrackerV2 demonstrates remarkable robustness under modality-missing scenarios.

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Unified Multimodal Visual Tracking with Dual Mixture-of-Experts

Key Points

Abstract

Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability

Solidity LM surpasses Opus

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer