Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning
arXiv cs.CV / 3/16/2026
📰 NewsModels & Research
Key Points
- The paper proposes V3Fusion, a fusion framework that uses focal error diversity and a CKA-based focal diversity metric to select and fuse outputs across a pool of heterogeneous VLMs for vision-language reasoning.
- It employs a genetic algorithm to prune non-contributing VLMs and identify the best model combination for each task, enabling dynamic epistemic uncertainty capture and reducing hallucinations.
- On four benchmarks (A-OKVQA, MMMU, MMMU-Pro, OCR-VQA), V3Fusion outperforms the strongest single VLMs, with gains of 8.09% on MMMU and 4.87% on MMMU-Pro, and beats top generative VLMs like Intern-VL2-8b and Qwen2.5-VL-7b on A-OKVQA and OCR-VQA.
- The authors provide code and datasets at GitHub, enabling replication.
Related Articles

Jeff Bezos reportedly wants $100 billion to buy and transform old manufacturing firms with AI
TechCrunch
[R] Weekly digest: arXiv AI security papers translated for practitioners -- Cascade (cross-stack CVE+Rowhammer attacks on compound AI), LAMLAD (dual-LLM adversarial ML, 97% evasion), OpenClaw (4 vuln classes in agent frameworks)
Reddit r/MachineLearning
My Experience with Qwen 3.5 35B
Reddit r/LocalLLaMA

Cursor’s new coding model Composer 2 is here: It beats Claude Opus 4.6 but still trails GPT-5.4
VentureBeat
Qwen 3.5 122B completely falls apart at ~ 100K context
Reddit r/LocalLLaMA