Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper shows that ensembling Vision-Language Models from the same architectural family yields correlated errors, reducing ensemble diversity and creating a Misleading tier where majority errors can drive the answer to 0% accuracy even when the best model is correct.
It introduces three family-aware methods: Hierarchical Family Voting (HFV) which aggregates within families before cross-family voting, QualRCCV which weights models by calibration, family quality, and inverse family size, and Learned Candidate Scoring (LCS) which trains a cross-validated classifier to re-rank candidate answers using features like support breadth, family diversity, and model quality.
HFV recovers 18-26 percentage points on the Misleading tier, QualRCCV beats calibrated voting on all three benchmarks (p<0.05), and LCS delivers the largest gains with modest absolute improvements (+0.68% VQAv2, +0.61% TextVQA, +2.45% GQA) and never degrades any benchmark.
On the VQAv2 test-standard EvalAI with 12 models, LCS reaches 87.83%, indicating strong generalization to held-out data.

Abstract

Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

Dev.to

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

Dev.to

The Research That Doesn't Exist

Dev.to

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

TechCrunch

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

Dev.to

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Key Points

Abstract

Related Articles

Hey dev.to community – sharing my journey with Prompt Builder, Insta Posts, and practical SEO

How to Build Passive Income with AI in 2026: A Developer's Practical Guide

The Research That Doesn't Exist

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI

Krish Naik: AI Learning Path For 2026- Data Science, Generative and Agentic AI Roadmap

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer