Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles
arXiv cs.CV / 3/19/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper shows that ensembling Vision-Language Models from the same architectural family yields correlated errors, reducing ensemble diversity and creating a Misleading tier where majority errors can drive the answer to 0% accuracy even when the best model is correct.
- It introduces three family-aware methods: Hierarchical Family Voting (HFV) which aggregates within families before cross-family voting, QualRCCV which weights models by calibration, family quality, and inverse family size, and Learned Candidate Scoring (LCS) which trains a cross-validated classifier to re-rank candidate answers using features like support breadth, family diversity, and model quality.
- HFV recovers 18-26 percentage points on the Misleading tier, QualRCCV beats calibrated voting on all three benchmarks (p<0.05), and LCS delivers the largest gains with modest absolute improvements (+0.68% VQAv2, +0.61% TextVQA, +2.45% GQA) and never degrades any benchmark.
- On the VQAv2 test-standard EvalAI with 12 models, LCS reaches 87.83%, indicating strong generalization to held-out data.
Related Articles
Automating the Chase: AI for Festival Vendor Compliance
Dev.to
MCP Skills vs MCP Tools: The Right Way to Configure Your Server
Dev.to
500 AI Prompts Every Content Creator Needs in 2026 (20 Free Samples)
Dev.to
Building a Game for My Daughter with AI — Part 1: What If She Could Build It Too?
Dev.to

Math needs thinking time, everyday knowledge needs memory, and a new Transformer architecture aims to deliver both
THE DECODER