Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning
arXiv cs.CV / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper argues that vision-language models often produce fluent answers that are insufficiently grounded in visual evidence, and that instruction prompting can make this worse by amplifying language priors.
- It introduces Instruction-Evidence Contrastive Dual-Stream Decoding (IECD2), which keeps two token-probability streams during generation: an instruction-driven stream for informativeness and an evidence-driven stream for visual faithfulness.
- IECD2 adaptively fuses these streams using a symmetric KL-based contrastive gating mechanism to suppress tokens favored by language priors but not supported by the image.
- Experiments across multiple captioning and visual question answering datasets (e.g., POPE, MME, VQAv2, AMBER, MS-COCO, and LLaVA-Bench) show consistent gains in accuracy and reasoning performance, alongside a large reduction in hallucinations versus state-of-the-art decoding methods.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to

An API testing tool built specifically for AI agent loops
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Automatic Error Recovery in AI Agent Networks
Dev.to