Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper shows that ensembling Vision-Language Models from the same architectural family yields correlated errors, reducing ensemble diversity and creating a Misleading tier where majority errors can drive the answer to 0% accuracy even when the best model is correct.
It introduces three family-aware methods: Hierarchical Family Voting (HFV) which aggregates within families before cross-family voting, QualRCCV which weights models by calibration, family quality, and inverse family size, and Learned Candidate Scoring (LCS) which trains a cross-validated classifier to re-rank candidate answers using features like support breadth, family diversity, and model quality.
HFV recovers 18-26 percentage points on the Misleading tier, QualRCCV beats calibrated voting on all three benchmarks (p<0.05), and LCS delivers the largest gains with modest absolute improvements (+0.68% VQAv2, +0.61% TextVQA, +2.45% GQA) and never degrades any benchmark.
On the VQAv2 test-standard EvalAI with 12 models, LCS reaches 87.83%, indicating strong generalization to held-out data.

Abstract

Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.

Automating the Chase: AI for Festival Vendor Compliance

Dev.to

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

Dev.to

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Dev.to

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

THE DECODER

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

Key Points

Abstract

Related Articles

Automating the Chase: AI for Festival Vendor Compliance

MCP Skills vs MCP Tools: The Right Way to Configure Your Server

500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)

Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer