From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses a key weakness of visual in-context learning models: they adapt to new tasks using examples but cannot directly incorporate user guidance such as scribbles, clicks, or bounding boxes to steer predictions.
It proposes a method to convert static visual in-context learners (notably DeLVM) into an interactive, user-controlled system called Interactive DeLVM by encoding user interactions into the example input-output pairs.
The approach preserves the core idea of visual in-context learning—supporting unseen interaction patterns without task-specific fine-tuning—while enabling users to dynamically refine outputs.
Experiments show that state-of-the-art visual in-context learning models often ignore interaction cues, whereas Interactive DeLVM improves interactive segmentation (+7.95% IoU), directed super-resolution (+2.46% PSNR), and interactive object removal (-3.14% LPIPS).
Overall, the work aims to bridge the gap between rigid static task adaptation and fluid, user-centric visual interactivity for real-world applications.

Abstract

Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of

+7.95%

IoU for interactive segmentation,

+2.46

PSNR for directed super-resolution, and

-3.14%

LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

Black Hat Asia

AI Business

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

TechCrunch

Why Anthropic’s new model has cybersecurity experts rattled

Reddit r/artificial

Does the AI 2027 paper still hold any legitimacy?

Reddit r/artificial

Why Most Productivity Systems Fail (And What to Do Instead)

Dev.to

From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Key Points

Abstract

Related Articles

Black Hat Asia

Amazon CEO takes aim at Nvidia, Intel, Starlink, more in annual shareholder letter

Why Anthropic’s new model has cybersecurity experts rattled

Does the AI 2027 paper still hold any legitimacy?

Why Most Productivity Systems Fail (And What to Do Instead)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer