Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

arXiv cs.CL / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The study finds that zero-shot vision-language model (VLM) safety classifiers that use single-prompt first-token probabilities as decision scores are unreliable, because semantically equivalent prompt reformulations can change the unsafe probability even for the same image.
  • Across multiple multimodal safety benchmarks and VLM families, prompt-to-prompt variance correlates strongly with prompt-level disagreement and higher classification error, making cross-prompt variance a practical diagnostic of prompt fragility.
  • A training-free mean ensemble over multiple prompts improves negative log-likelihood (NLL) on all 14 dataset–model pairs and improves expected calibration error (ECE) on 12/14, outperforming several common prompt-calibration or scaling approaches applied to a single prompt.
  • The authors also show that when labels are available, adding labeled calibration on top of mean aggregation provides further benefits, and they recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline for zero-shot VLM safety scoring.

Abstract

Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.