Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

arXiv cs.CV / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Contrastive vision-language models like CLIP can generalize well in zero-shot settings, but prompt tuning is vulnerable to label noise because mislabels produce very large gradients that can override the model’s pretrained priors.
  • The paper proposes Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free technique that intrinsically suppresses gradients from high-error (likely noisy) samples while preserving useful updates.
  • DSPT works by applying a sequential probabilistic normalization that creates a self-adaptive “saturation zone,” effectively filtering out harmful gradient signals caused by label noise.
  • The authors provide theoretical analysis and empirical results showing how DSPT achieves adaptive gradient suppression and mitigates a “gradient vanishing” bottleneck by repurposing it as noise filtering.
  • Extensive experiments across multiple noisy benchmarks show DSPT is a simple drop-in design that achieves state-of-the-art robustness, outperforming more complex methods that rely on handcrafted hyperparameters.

Abstract

Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.