AI Navigate

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

arXiv cs.CL / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Introduces a benchmark for task interference in multimodal LLMs across six tasks with history-target variations in three axes (modality mismatch, reasoning mismatch, and answer format mismatch).
  • Finds that interference is directionally biased: switching from text-only to image-based targets causes severe degradation, while the opposite transition yields less degradation.
  • Demonstrates that co-occurring mismatches amplify interference and that modality differences are the strongest driver, followed by answer format, with reasoning requirement shifts having minimal impact.
  • Includes experiments on both open-weight and proprietary models, highlighting practical implications for multimodal dialogue system design.

Abstract

Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.