VCBench: A Streaming Counting Benchmark for Spatial-Temporal State Maintenance in Long Videos
arXiv cs.CV / 3/16/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- VCBench introduces a streaming counting benchmark to diagnose how video models maintain spatial-temporal world state during long videos.
- It decomposes state maintenance into object counting (visible objects vs. cumulative identities) and event counting (instant actions vs. complete activity cycles) across eight fine-grained subcategories.
- The dataset contains 406 videos with frame-by-frame annotations of 10,071 event moments and object state changes, plus 1,000 streaming QA pairs and 4,576 query points.
- Evaluation shows mainstream video-language models have significant deficiencies in state maintenance, especially in periodic event counting, highlighting the benchmark's diagnostic value.
Related Articles

Astral to Join OpenAI
Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA

Why Data is Important for LLM
Dev.to

Waymo hits 170 million miles while avoiding serious mayhem
The Verge

The Inference Market Is Consolidating. Agent Payments Are Still Nobody's Problem.
Dev.to