Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

arXiv cs.CL / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that stochastic decoding strategies common in LLMs can be suboptimal for Visual Question Answering (VQA), which is typically a closed-ended task with answer distributions that are “head-heavy.”
It provides a theoretical framework linking model calibration to predictive accuracy and derives sufficient conditions under which greedy decoding is optimal.
Experiments across multiple VQA benchmarks show that greedy decoding outperforms stochastic sampling, supporting the calibration-based argument.
The authors introduce “Greedy Decoding for Reasoning Models,” demonstrating improved performance over both stochastic sampling and standard greedy decoding in multimodal reasoning settings.
The work cautions against blindly porting LLM decoding heuristics to multimodal LLMs, proposing greedy decoding as an efficient and strong default for VQA.

Abstract

Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

Dev.to

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

Dev.to

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

Dev.to

An improvement of the convergence proof of the ADAM-Optimizer

Dev.to

Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

Key Points

Abstract

Related Articles

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Everyone Wants AI Agents. Fewer Teams Are Ready for the Messy Business Context Behind Them

AI 编程工具对比 2026：Claude Code vs Cursor vs Gemini CLI vs Codex

How I Improved My YouTube Shorts and Podcast Audio Workflow with AI Tools

An improvement of the convergence proof of the ADAM-Optimizer

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer