Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

arXiv cs.CV / 3/16/2026

📰 NewsModels & Research

共有:

Key Points

The paper proposes V3Fusion, a fusion framework that uses focal error diversity and a CKA-based focal diversity metric to select and fuse outputs across a pool of heterogeneous VLMs for vision-language reasoning.
It employs a genetic algorithm to prune non-contributing VLMs and identify the best model combination for each task, enabling dynamic epistemic uncertainty capture and reducing hallucinations.
On four benchmarks (A-OKVQA, MMMU, MMMU-Pro, OCR-VQA), V3Fusion outperforms the strongest single VLMs, with gains of 8.09% on MMMU and 4.87% on MMMU-Pro, and beats top generative VLMs like Intern-VL2-8b and Qwen2.5-VL-7b on A-OKVQA and OCR-VQA.
The authors provide code and datasets at GitHub, enabling replication.

Abstract

With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

THE DECODER

How to Choose the Best AI Chat Models of 2026 for Your Business Needs

Dev.to

I built an AI that generates lesson plans in your exact teaching voice (open source)

Dev.to

6-Band Prompt Decomposition: The Complete Technical Guide

Dev.to

Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

Key Points

Abstract

Related Articles

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

How to Choose the Best AI Chat Models of 2026 for Your Business Needs

I built an AI that generates lesson plans in your exact teaching voice (open source)

6-Band Prompt Decomposition: The Complete Technical Guide

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer