Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

arXiv cs.CL / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Introduces a benchmark for task interference in multimodal LLMs across six tasks with history-target variations in three axes (modality mismatch, reasoning mismatch, and answer format mismatch).
Finds that interference is directionally biased: switching from text-only to image-based targets causes severe degradation, while the opposite transition yields less degradation.
Demonstrates that co-occurring mismatches amplify interference and that modality differences are the strongest driver, followed by answer format, with reasoning requirement shifts having minimal impact.
Includes experiments on both open-weight and proprietary models, highlighting practical implications for multimodal dialogue system design.

Abstract

Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Dev.to

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Key Points

Abstract

Related Articles

Automating the Chase: AI for Festival Vendor Compliance

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer