Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
arXiv cs.AI / 4/20/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper argues that intermediate outputs in multi-step zoom-in visual grounding pipelines contain a “free” confidence signal called zoom consistency, defined as the geometric distance between a step-2 prediction and the crop center.
- Zoom consistency is proposed as a calibration-free uncertainty measure because it is a geometric quantity in a shared coordinate space, allowing direct comparison across different VLM architectures.
- Under idealized assumptions, the authors show zoom consistency acts as a linear estimator of step-1 spatial error, and experimentally correlates with prediction correctness across two VLMs.
- As a proof of concept, zoom consistency is used to route inputs between a specialist and generalist model, improving utilization by capturing 16.5% of the oracle headroom (with a reported +0.8% gain; McNemar p = 0.19).
- The authors provide code for the routing approach in a public GitHub repository.



