LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
arXiv cs.AI / 4/22/2026
💬 OpinionDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper introduces Ghost-100, a new benchmark (800 synthetic images across eight categories and three vision-language task families) designed to study how hallucinations change under progressively coercive prompt tone.
- It uses a 5-Level Prompt Intensity Framework that keeps the image and task fixed while varying only directive force, allowing “tone” to be isolated as the key independent variable.
- The authors evaluate models with a dual-track approach: an H-Rate rule-based metric for when systems shift from grounded refusal to unsupported positive claims, and a GPT-4o-mini-judged H-Score (1–5) to quantify the confidence/specificity of fabrication.
- A three-stage automated validation process verifies 717 of 800 images as strictly compliant with the negative-ground-truth design, and results show strong metric differences across model families and sometimes non-monotonic sensitivity peaking at intermediate tones.
- Testing nine open-weight VLMs reveals that hallucination incidence and intensity can diverge and that reading-style vs presence-detection subsets respond differently to prompt pressure, which aggregate metrics may hide.
Related Articles

Enterprise AI Governance Has Shifted from Policy to Execution
Dev.to

Rethinking CNN Models for Audio Classification
Dev.to
v0.20.0rc1
vLLM Releases

Build-in-Public: What I Learned Building an AI Image SaaS
Dev.to
I built my own event bus for a sustainability app — here's what I learned about agent automation using OpenClaw
Dev.to