Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

arXiv cs.CL / 5/4/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study finds that zero-shot vision-language model (VLM) safety classifiers that use single-prompt first-token probabilities as decision scores are unreliable, because semantically equivalent prompt reformulations can change the unsafe probability even for the same image.
Across multiple multimodal safety benchmarks and VLM families, prompt-to-prompt variance correlates strongly with prompt-level disagreement and higher classification error, making cross-prompt variance a practical diagnostic of prompt fragility.
A training-free mean ensemble over multiple prompts improves negative log-likelihood (NLL) on all 14 dataset–model pairs and improves expected calibration error (ECE) on 12/14, outperforming several common prompt-calibration or scaling approaches applied to a single prompt.
The authors also show that when labels are available, adding labeled calibration on top of mean aggregation provides further benefits, and they recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline for zero-shot VLM safety scoring.

Abstract

Single-prompt first-token probabilities from zero-shot vision-language model (VLM) safety classifiers are treated as decision scores, but we show they are unreliable under semantically equivalent prompt reformulation: even when the binary label is constrained to a fixed output position, equivalent prompts can induce materially different unsafe probabilities for the same sample. Across multimodal safety benchmarks and multiple VLM families, cross-prompt variance is strongly associated with prompt-level disagreement and higher error, making it a useful fragility diagnostic. A training-free mean ensemble improves NLL on all 14 dataset-model evaluation pairs and ECE on 12/14 relative to a train-selected single-prompt baseline, and wins more head-to-head NLL comparisons than labeled temperature scaling, Platt scaling, and isotonic regression applied to the same prompt. Ranking gains are consistent against the train-selected baseline on both AUROC and AUPRC, and against the full 15-prompt distribution remain consistent on AUPRC while softening on AUROC. Labeled calibration on top of the mean provides further gains when labels are available, identifying prompt averaging as a strong label-free first stage rather than a replacement for calibration. We frame this as a reliability stress test for zero-shot VLM first-token safety scores and recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 5/4DailyView insight →

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

Dev.to

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

Dev.to

AI is getting better at doing things, but still bad at deciding what to do?

Reddit r/artificial

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

Dev.to

Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification

Key Points

Abstract

💡 Insights using this article

Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool

AI is getting better at doing things, but still bad at deciding what to do?

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer