EV-CLIP: Efficient Visual Prompt Adaptation for CLIP in Few-shot Action Recognition under Visual Challenges
arXiv cs.CV / 4/27/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- CLIP’s language-supervised generalization can extend to video action recognition, but prior adaptation methods tend to emphasize temporal modeling and neglect spatial perception critical under visual challenges.
- EV-CLIP is proposed as an efficient few-shot action recognition adaptation framework that uses two types of visual prompts: mask prompts to reweight pixels toward action-relevant regions, and context prompts to compress frame-wise features for lightweight temporal modeling.
- The work evaluates EV-CLIP on five curated benchmark datasets, analyzing domain shifts to measure how visual and semantic factors affect action recognition.
- Experiments show EV-CLIP outperforms existing parameter-efficient approaches overall, while maintaining efficiency that does not depend on the backbone model scale, improving suitability for resource-constrained deployment.
- The authors provide an open-source codebase for EV-CLIP at the linked GitHub repository.
Related Articles

The company with a monopoly on AI's most critical machine is racing to build more
THE DECODER

Legal Insight Transformation: A Beginner's Guide to Modern Research
Dev.to
The Open Source AI Studio That Nobody's Talking About
Dev.to

OpenAI reportedly developing its own smartphone chips with MediaTek and Qualcomm
THE DECODER

How I Built a 10-Language Sports Analytics Platform with FastAPI, SQLite, and Claude AI (As a Solo Non-Technical Founder)
Dev.to