From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key weakness of visual in-context learning models: they adapt to new tasks using examples but cannot directly incorporate user guidance such as scribbles, clicks, or bounding boxes to steer predictions.
  • It proposes a method to convert static visual in-context learners (notably DeLVM) into an interactive, user-controlled system called Interactive DeLVM by encoding user interactions into the example input-output pairs.
  • The approach preserves the core idea of visual in-context learning—supporting unseen interaction patterns without task-specific fine-tuning—while enabling users to dynamically refine outputs.
  • Experiments show that state-of-the-art visual in-context learning models often ignore interaction cues, whereas Interactive DeLVM improves interactive segmentation (+7.95% IoU), directed super-resolution (+2.46% PSNR), and interactive object removal (-3.14% LPIPS).
  • Overall, the work aims to bridge the gap between rigid static task adaptation and fluid, user-centric visual interactivity for real-world applications.

Abstract

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of +7.95% IoU for interactive segmentation, +2.46 PSNR for directed super-resolution, and -3.14% LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.