Learning to Select Visual In-Context Demonstrations

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper analyzes how multimodal LLMs use in-context learning for visual tasks and argues that the common unsupervised kNN-based demonstration selection can be sub-optimal for factual regression because it over-selects redundant examples.
  • It reframes demonstration selection as a sequential decision-making problem and proposes Learning to Select Demonstrations (LSD), which trains a reinforcement learning agent to build demonstration sets that maximize the MLLM’s downstream performance.
  • The proposed LSD system uses a Dueling DQN with a query-centric Transformer decoder to learn a policy balancing visual relevance and diversity.
  • Experiments across five visual regression benchmarks show a key outcome: kNN is still best for subjective preference tasks, but LSD substantially outperforms baselines on objective, factual regression tasks.
  • The authors conclude that learned demonstration selection is strictly necessary for certain visual ICL settings, especially where regression boundaries must be well defined.

Abstract

Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD significantly outperforms baselines on objective, factual regression tasks. By balancing visual relevance with diversity, LSD better defines regression boundaries, illuminating when learned selection is strictly necessary for visual ICL.