UniICL: Systematizing Unified Multimodal In-context Learning through a Capability-Oriented Taxonomy

arXiv cs.CV / 3/27/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that multimodal in-context learning is especially sensitive to demonstration selection/formatting due to cross-modal interference and different task cognitive demands, leading to non-monotonic, task-dependent performance.
  • It introduces a six-level, capability-oriented taxonomy to diagnose and systematically categorize what functional role demonstrations serve, from basic perception to higher-order discernment.
  • The authors build UniICL-760K (curated 8-shot episodes across 15 subtasks) and UniICL-Bench for controlled evaluation of unified multimodal in-context learning.
  • To stabilize few-shot adaptation, they propose the Context-Adaptive Prototype Modulator, a lightweight plug-and-play architectural module.
  • Experiments on UniICL-Bench show competitive unified multimodal results and outperform larger-parameter multimodal LLM baselines on most understanding-focused in-context learning tasks, with data/code planned for release soon.

Abstract

In-context Learning enables training-free adaptation via demonstrations but remains highly sensitive to example selection and formatting. In unified multimodal models spanning understanding and generation, this sensitivity is exacerbated by cross-modal interference and varying cognitive demands. Consequently, In-context Learning efficacy is often non-monotonic and highly task-dependent. To diagnose these behaviors, we introduce a six-level capability-oriented taxonomy that categorizes the functional role of demonstrations from basic perception to high-order discernment. Guided by this cognitive framework, we construct UniICL-760K, a large-scale corpus featuring curated 8-shot In-context Learning episodes across 15 subtasks, alongside UniICL-Bench for rigorous, controlled evaluation. As an architectural intervention to stabilize few-shot adaptation, we propose the Context-Adaptive Prototype Modulator, a lightweight, plug-and-play module. Evaluations on UniICL-Bench show that our approach yields highly competitive unified results, outperforming larger-parameter multimodal large language model baselines on most understanding In-context Learning tasks. Data and code will be available soon at https://github.com/xuyicheng-zju/UniICL.