Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection
arXiv cs.CV / 3/26/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper investigates why large vision-language models (LVLMs) do well on many zero-shot generative tasks yet perform poorly on image classification compared with CLIP-based approaches despite using CLIP-pretrained vision encoders.
- It argues that CLIP’s class-name matching bias differs from joint visual-text reasoning and shows LVLMs can improve class separability at inference via prompt conditioning.
- The authors propose Head Ensemble Classifiers (HEC), a training-free method that selects and ensembles the most discriminative attention heads from both vision and text components.
- HEC is inspired by Gaussian Discriminant Analysis and is designed to close the performance gap between CLIP-based and LVLM-based classification methods.
- Experiments reportedly achieve state-of-the-art results for zero-shot and few-shot image classification across 12 datasets.
Related Articles
Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets
Dev.to
Mercor competitor Deccan AI raises $25M, sources experts from India
Dev.to
How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)
Dev.to
How Should Students Document AI Usage in Academic Work?
Dev.to

I asked my AI agent to design a product launch image. Here's what came back.
Dev.to