AI Navigate

Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that ensembling Vision-Language Models from the same architectural family yields correlated errors, reducing ensemble diversity and creating a Misleading tier where majority errors can drive the answer to 0% accuracy even when the best model is correct.
  • It introduces three family-aware methods: Hierarchical Family Voting (HFV) which aggregates within families before cross-family voting, QualRCCV which weights models by calibration, family quality, and inverse family size, and Learned Candidate Scoring (LCS) which trains a cross-validated classifier to re-rank candidate answers using features like support breadth, family diversity, and model quality.
  • HFV recovers 18-26 percentage points on the Misleading tier, QualRCCV beats calibrated voting on all three benchmarks (p<0.05), and LCS delivers the largest gains with modest absolute improvements (+0.68% VQAv2, +0.61% TextVQA, +2.45% GQA) and never degrades any benchmark.
  • On the VQAv2 test-standard EvalAI with 12 models, LCS reaches 87.83%, indicating strong generalization to held-out data.

Abstract

Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best model being correct. We propose three family-aware methods. Hierarchical Family Voting (HFV) aggregates within families before voting across them, recovering +18-26 pp on the Misleading tier. QualRCCV, a training-free method weighting models by calibration, family quality, and inverse family size, is the first to beat calibrated voting on all three benchmarks (p<0.05). Learned Candidate Scoring (LCS) trains a cross-validated classifier to re-rank candidate answers using support breadth, family diversity, and model quality, achieving the largest gains: +0.68% VQAv2, +0.61% TextVQA, +2.45% GQA -- all significant -- and is the only learned method that never degrades any benchmark. On VQAv2 test-standard (EvalAI), LCS reaches 87.83% with 12 models, confirming generalization.