Learning to Select Visual In-Context Demonstrations
arXiv cs.LG / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper analyzes how multimodal LLMs use in-context learning for visual tasks and argues that the common unsupervised kNN-based demonstration selection can be sub-optimal for factual regression because it over-selects redundant examples.
- It reframes demonstration selection as a sequential decision-making problem and proposes Learning to Select Demonstrations (LSD), which trains a reinforcement learning agent to build demonstration sets that maximize the MLLM’s downstream performance.
- The proposed LSD system uses a Dueling DQN with a query-centric Transformer decoder to learn a policy balancing visual relevance and diversity.
- Experiments across five visual regression benchmarks show a key outcome: kNN is still best for subjective preference tasks, but LSD substantially outperforms baselines on objective, factual regression tasks.
- The authors conclude that learned demonstration selection is strictly necessary for certain visual ICL settings, especially where regression boundaries must be well defined.
Related Articles

Black Hat Asia
AI Business
[D] How does distributed proof of work computing handle the coordination needs of neural network training?
Reddit r/MachineLearning

Claude Code's Entire Source Code Was Just Leaked via npm Source Maps — Here's What's Inside
Dev.to

BYOK is not just a pricing model: why it changes AI product trust
Dev.to

AI Citation Registries and Identity Persistence Across Records
Dev.to