Mixture-of-Modality-Experts with Holistic Token Learning for Fine-Grained Multimodal Visual Analytics in Driver Action Recognition

arXiv cs.CV / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a Mixture-of-Modality-Experts (MoME) framework to address multimodal learning challenges where modality reliability varies by input and fixed fusion/interactions are insufficient.
  • It adds a Holistic Token Learning (HTL) strategy using class tokens and spatio-temporal tokens to refine each modality expert and transfer knowledge across experts for more fine-grained understanding.
  • The approach is framed as a knowledge-centric multimodal learning method that improves expert specialization while reducing ambiguity during multimodal fusion.
  • Experiments on a driver action recognition benchmark show that MoME combined with HTL outperforms both single-modal and multimodal baselines.
  • Ablation, validation, and visualization results are reported to confirm that HTL enhances subtle multimodal cues and improves interpretability.

Abstract

Robust multimodal visual analytics remains challenging when heterogeneous modalities provide complementary but input-dependent evidence for decision-making.Existing multimodal learning methods mainly rely on fixed fusion modules or predefined cross-modal interactions, which are often insufficient to adapt to changing modality reliability and to capture fine-grained action cues. To address this issue, we propose a Mixture-of-Modality-Experts (MoME) framework with a Holistic Token Learning (HTL) strategy. MoME enables adaptive collaboration among modality-specific experts, while HTL improves both intra-expert refinement and inter-expert knowledge transfer through class tokens and spatio-temporal tokens. In this way, our method forms a knowledge-centric multimodal learning framework that improves expert specialization while reducing ambiguity in multimodal fusion.We validate the proposed framework on driver action recognition as a representative multimodal understanding taskThe experimental results on the public benchmark show that the proposed MoME framework and the HTL strategy jointly outperform representative single-modal and multimodal baselines. Additional ablation, validation, and visualization results further verify that the proposed HTL strategy improves subtle multimodal understanding and offers better interpretability.