Dual-Modality Anchor-Guided Filtering for Test-time Prompt Tuning

arXiv cs.CV / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a dual-modality anchor-guided filtering method for test-time prompt tuning in vision-language models, aiming to select informative augmented views more reliably than entropy-only approaches.
  • It introduces a text anchor using attribute-rich class descriptions for fine-grained semantic grounding, alongside an adaptive image anchor that reflects evolving test-time statistics.
  • View filtering is performed using alignment with the anchors and confidence measures, specifically to avoid miscalibration issues under distribution shift that cause models to overvalue irrelevant crops/background.
  • The anchors are also used as auxiliary predictive heads, and their outputs are ensembled with confidence weighting to provide a more stable supervision signal for updating prompts.
  • Experiments across 15 benchmark datasets show state-of-the-art performance, suggesting anchor-guided supervision improves the robustness of prompt updates.

Abstract

Test-Time Prompt Tuning (TPT) adapts vision-language models using augmented views, but its effectiveness is hindered by the challenge of determining which views are beneficial. Standard entropy-based filtering relies on the internal confidence scores of the model, which are often miscalibrated under distribution shift, assigning high confidence to irrelevant crops or background regions while ignoring semantic content. To address this, we propose a dual-modality anchor-guided framework that grounds view selection in semantic evidence. We introduce a text anchor from attribute-rich descriptions, to provide fine-grained class semantics, and an adaptive image anchor that captures evolving test-time statistics. Using these anchors, we filter views based on alignment and confidence, ensuring that only informative views guide adaptation. Moreover, we treat the anchors as auxiliary predictive heads and combine their predictions with the original output in a confidence-weighted ensemble, yielding a stable supervision signal for prompt updates. Extensive experiments on 15 benchmark datasets demonstrate new state-of-the-art performance, highlighting the contribution of anchor-guided supervision as a foundation for robust prompt updates.