VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
arXiv cs.CV / 4/3/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces VideoZeroBench, a new hierarchical benchmark for long-video question answering that verifies spatio-temporal evidence rather than relying only on answer accuracy.
- It contains 500 manually annotated questions across 13 domains, each paired with temporal intervals and spatial bounding boxes that serve as required evidence for correct predictions.
- A five-level evaluation protocol separates answering generation from temporal grounding and spatial grounding by progressively tightening constraints on what the model must correctly localize.
- Results indicate a large gap between surface-level correctness and evidence-based reasoning, with Gemini-3-Pro answering under 17% correctly at Level-3 and essentially no model reaching above 1% when both answering and precise spatio-temporal localization are required at Level-5.
- The authors provide additional analyses (e.g., performance vs. minimal evidence spans and atomic abilities) and plan to release the benchmark and code publicly to support future grounded video reasoning research.




