SIEVES: Selective Prediction Generalizes through Visual Evidence Scoring
arXiv cs.CV / 4/29/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SIEVES, a selective prediction method for multimodal large language models that improves reliability in out-of-distribution (OOD) visual-language settings by using localized visual evidence scoring.
- SIEVES requires “reasoner” models to produce localized visual evidence and trains a separate selector to estimate the quality of that localization so the system can abstain when risk would exceed a user-defined tolerance.
- Experiments show coverage gains of up to 3× on multiple challenging OOD benchmarks (V* Bench, HR-Bench-8k, MME-RealWorld-Lite, VizWiz, and AdVQA) compared with non-grounding baselines.
- The selector design supports transfer to proprietary reasoners (e.g., o3 and Gemini-3-Pro) without access to their internal weights or logits, yielding coverage improvements beyond what accuracy alone would provide.
- Results indicate SIEVES generalizes across all tested OOD datasets and reasoner models without benchmark-specific or reasoner-specific training/adaptation.
Related Articles

How I Use AI Agents to Maintain a Living Knowledge Base for My Team
Dev.to
IK_LLAMA now supports Qwen3.5 MTP Support :O
Reddit r/LocalLLaMA
OpenAI models, Codex, and Managed Agents come to AWS
Dev.to

Indian Developers: How to Build AI Side Income with $0 Capital in 2026
Dev.to

Vertical SaaS for Startups 2026: Building a Niche AI-First Product
Dev.to