EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges

arXiv cs.CV / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

CLIP’s language-supervised generalization can extend to video action recognition, but prior adaptation methods tend to emphasize temporal modeling and neglect spatial perception critical under visual challenges.
EV-CLIP is proposed as an efficient few-shot action recognition adaptation framework that uses two types of visual prompts: mask prompts to reweight pixels toward action-relevant regions, and context prompts to compress frame-wise features for lightweight temporal modeling.
The work evaluates EV-CLIP on five curated benchmark datasets, analyzing domain shifts to measure how visual and semantic factors affect action recognition.
Experiments show EV-CLIP outperforms existing parameter-efficient approaches overall, while maintaining efficiency that does not depend on the backbone model scale, improving suitability for resource-constrained deployment.
The authors provide an open-source codebase for EV-CLIP at the linked GitHub repository.

Abstract

CLIP has demonstrated strong generalization in visual domains through natural language supervision, even for video action recognition. However, most existing approaches that adapt CLIP for action recognition have primarily focused on temporal modeling, often overlooking spatial perception. In real-world scenarios, visual challenges such as low-light environments or egocentric viewpoints can severely impair spatial understanding, an essential precursor for effective temporal reasoning. To address this limitation, we propose Efficient Visual Prompting for CLIP (EV-CLIP), an efficient adaptation framework designed for few-shot video action recognition across diverse scenes and viewpoints. EV-CLIP introduces two visual prompts: mask prompts, which guide the model's attention to action-relevant regions by reweighting pixels, and context prompts, which perform lightweight temporal modeling by compressing frame-wise features into a compact representation. For a comprehensive evaluation, we curate five benchmark datasets and analyze domain shifts to quantify the influence of diverse visual and semantic factors on action recognition. Experimental results demonstrate that EV-CLIP outperforms existing parameter-efficient methods in overall performance. Moreover, its efficiency remains independent of the backbone scale, making it well-suited for deployment in real-world, resource-constrained scenarios. The code is available at https://github.com/AI-CV-Lab/EV-CLIP.

The company with a monopoly on AI's most critical machine is racing to build more

THE DECODER

Legal Insight Transformation: A Beginner's Guide to Modern Research

Dev.to

The Open Source AI Studio That Nobody's Talking About

Dev.to

OpenAI reportedly developing its own smartphone chips with MediaTek and Qualcomm

THE DECODER

How I Built a 10-Language Sports Analytics Platform with FastAPI, SQLite, and Claude AI (As a Solo Non-Technical Founder)

Dev.to

EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges

Key Points

Abstract

Related Articles

The company with a monopoly on AI's most critical machine is racing to build more

Legal Insight Transformation: A Beginner's Guide to Modern Research

The Open Source AI Studio That Nobody's Talking About

OpenAI reportedly developing its own smartphone chips with MediaTek and Qualcomm

How I Built a 10-Language Sports Analytics Platform with FastAPI, SQLite, and Claude AI (As a Solo Non-Technical Founder)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer