Prototype-Based Test-Time Adaptation of Vision-Language Models

arXiv cs.CV / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Prototype-Based Test-Time Adaptation (PTA) for vision-language models to reduce the distribution gap between pre-training and test data during inference.
  • PTA avoids backpropagation-free cache-based designs by using class-specific knowledge prototypes that are updated by accumulating information from test samples.
  • It adaptively weights prototype updates using each test sample’s zero-shot class confidence and incorporates the sample’s visual features into the corresponding class prototype.
  • By integrating past test knowledge only into prototypes, PTA eliminates cache population and retrieval overhead, improving efficiency and scalability as the number of classes grows.
  • Experiments report state-of-the-art results across 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks, including improving CLIP accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks while keeping about 92% of CLIP’s inference speed on ImageNet-1K, outperforming cache-based TTA in both accuracy and speed.

Abstract

Test-time adaptation (TTA) has emerged as a promising paradigm for vision-language models (VLMs) to bridge the distribution gap between pre-training and test data. Recent works have focused on backpropagation-free TTA methods that rely on cache-based designs, but these introduce two key limitations. First, inference latency increases as the cache grows with the number of classes, leading to inefficiencies in large-scale settings. Second, suboptimal performance occurs when the cache contains insufficient or incorrect samples. In this paper, we present Prototype-Based Test-Time Adaptation (PTA), an efficient and effective TTA paradigm that uses a set of class-specific knowledge prototypes to accumulate knowledge from test samples. Particularly, knowledge prototypes are adaptively weighted based on the zero-shot class confidence of each test sample, incorporating the sample's visual features into the corresponding class-specific prototype. It is worth highlighting that the knowledge from past test samples is integrated and utilized solely in the prototypes, eliminating the overhead of cache population and retrieval that hinders the efficiency of existing TTA methods. This endows PTA with extremely high efficiency while achieving state-of-the-art performance on 15 image recognition benchmarks and 4 robust point cloud analysis benchmarks. For example, PTA improves CLIP's accuracy from 65.64% to 69.38% on 10 cross-domain benchmarks, while retaining 92% of CLIP's inference speed on large-scale ImageNet-1K. In contrast, the cache-based TDA achieves a lower accuracy of 67.97% and operates at only 50% of CLIP's inference speed.