Leveraging Vision-Language Models as Weak Annotators in Active Learning

arXiv cs.CV / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper explores using vision-language models (VLMs) to reduce human annotation cost in active learning by generating weak labels instead of fully labeling every sample.
  • It finds that VLM reliability depends strongly on label granularity in fine-grained recognition: VLMs struggle with fine-grained labels but can produce accurate coarse-grained labels.
  • The authors propose an active learning framework that assigns labels instance-wise by combining limited fine-grained human annotations with coarse-grained VLM-generated weak labels.
  • They also account for systematic noise in the VLM-generated labels by calibrating with a small set of trusted full (human) labels.
  • Experiments on CUB200 and FGVC-Aircraft show the approach consistently beats prior active learning methods using the same annotation budget.

Abstract

Active learning aims to reduce annotation cost by selectively querying informative samples for supervision under a limited labeling budget. In this work, we investigate how vision-language models (VLMs) can be leveraged to further reduce the reliance on costly human annotation within the active learning paradigm. To this end, we find that the reliability of VLMs varies significantly with label granularity in fine-grained recognition tasks: they perform poorly on fine-grained labels but can provide accurate coarse-grained labels. Leveraging this property, we propose an active learning framework that combines fine-grained human annotations with coarse-grained VLM-generated weak labels through instance-wise label assignment. We further model the systematic noise in VLM-generated labels using a small set of trusted full labels. Experiments on CUB200 and FGVC-Aircraft show that the proposed framework consistently outperforms existing active learning methods under the same annotation budget.