Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models
arXiv cs.CL / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes “training-free evidence retrieval” for vision-language models by treating grounding as an iterative, test-time process of finding where to look next for ambiguous queries.
- It introduces an entropy-gradient relevance map computed by backpropagating entropy of the model’s next-token distribution to visual token embeddings, avoiding auxiliary detectors or attention-map heuristics.
- For multi-evidence (compositional) questions, the method extracts and ranks multiple coherent visual regions to assemble supporting evidence across different areas of an input.
- An iterative zoom-and-reground strategy with a spatial-entropy stopping rule helps prevent over-refinement while improving localization quality.
- Experiments on seven benchmarks across four VLM architectures show consistent gains over prior approaches, especially in detail-critical and high-resolution settings, and yield more interpretable evidence localizations.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

GLM 5.1 tops the code arena rankings for open models
Reddit r/LocalLLaMA

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

My Bestie Built a Free MCP Server for Job Search — Here's How It Works
Dev.to
can we talk about how AI has gotten really good at lying to you?
Reddit r/artificial