Unified Multimodal Visual Tracking with Dual Mixture-of-Experts
arXiv cs.CV / 5/6/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes OneTrackerV2, a unified end-to-end multimodal visual object tracking framework designed to handle any input modality without training separate models per modality.
- It introduces Meta Merger to map multimodal information into a shared representation space, enabling flexible modality fusion and improving robustness.
- It presents Dual Mixture-of-Experts (DMoE), where one MoE component (T-MoE) models spatio-temporal relationships for tracking and another (M-MoE) embeds multimodal knowledge to reduce feature conflicts and disentangle cross-modal dependencies.
- OneTrackerV2 reports state-of-the-art results across five RGB and RGB+X tracking tasks and 12 benchmarks, while keeping high inference efficiency and maintaining performance after model compression.
- The method also shows strong robustness when modalities are missing during inference, addressing a key practical limitation of many existing multimodal trackers.
Related Articles

Top 10 Free AI Tools for Students in 2026: The Ultimate Study Guide
Dev.to

AI as Your Contingency Co-Pilot: Automating Wedding Day 'What-Ifs'
Dev.to

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss
MarkTechPost
When Claude Hallucinates in Court: The Latham & Watkins Incident and What It Means for Attorney Liability
MarkTechPost
Solidity LM surpasses Opus
Reddit r/LocalLLaMA