Prompt-Induced Score Variance in Zero-Shot Binary Vision-Language Safety Classification
arXiv cs.CL / 5/4/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The study finds that zero-shot vision-language model (VLM) safety classifiers that use single-prompt first-token probabilities as decision scores are unreliable, because semantically equivalent prompt reformulations can change the unsafe probability even for the same image.
- Across multiple multimodal safety benchmarks and VLM families, prompt-to-prompt variance correlates strongly with prompt-level disagreement and higher classification error, making cross-prompt variance a practical diagnostic of prompt fragility.
- A training-free mean ensemble over multiple prompts improves negative log-likelihood (NLL) on all 14 dataset–model pairs and improves expected calibration error (ECE) on 12/14, outperforming several common prompt-calibration or scaling approaches applied to a single prompt.
- The authors also show that when labels are available, adding labeled calibration on top of mean aggregation provides further benefits, and they recommend prompt-family evaluation with mean aggregation as a standard label-free reliability baseline for zero-shot VLM safety scoring.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

When Claims Freeze Because a Provider Record Drifted: The Case for Enrollment Repair Agents
Dev.to

The Cash Is Already Earned: Why Construction Pay Application Exceptions Fit an Agent Better Than SaaS
Dev.to

Why Ship-and-Debit Claim Recovery Is a Better Agent Wedge Than Another “AI Back Office” Tool
Dev.to
AI is getting better at doing things, but still bad at deciding what to do?
Reddit r/artificial

I Built an AI-Powered Chinese BaZi (八字) Fortune Teller — Here's What DeepSeek Revealed About Destiny
Dev.to