Rethinking Ground Truth: A Case Study on Human Label Variation in MLLM Benchmarking
arXiv cs.CL / 3/23/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces an evaluation protocol for multimodal LLM benchmarking that explicitly accounts for human label agreement and disagreement (HLV).
- It applies this protocol to two state-of-the-art MLLM families (Gemma 3 and Qwen 2.5 VL) using non-aggregated human annotations from a social media content classification dataset.
- The findings show that larger models tend to excel on high-agreement subsets but can underperform medium-sized models when disagreement is high, indicating that model sensitivity to ambiguity is not solely determined by parameter count.
- The authors argue that benchmarks based only on consensus labels can overstate model capabilities in content moderation and that incorporating human label variation yields more realistic, robust assessments for MLLMs in real pipelines.
Related Articles
Santa Augmentcode Intent Ep.6
Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’
Reddit r/artificial
Scaffolded Test-First Prompting: Get Correct Code From the First Run
Dev.to