PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

arXiv cs.CL / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

PerceptionCompは、長期的で知覚中心のビデオ推論を評価するための手動注釈付きベンチマークで、単一の瞬間では答えられず複数時点の視覚証拠と論理的制約が必要になります。
ベンチマークは279本の多様なドメインの動画からなり、計1,114問を対象に、物体・属性・関係・位置・行動・出来事など幅広い知覚サブタスクと、意味認識・対応付け・時間推論・空間推論を要求します。
人間評価では、既存ベンチマークより大幅に推論（テスト時の思考）と複数の知覚ステップが必要で、再視聴を禁止すると精度が近い値（18.97%）まで落ちることが示されています。
既存のSOTA MLLMでもPerceptionCompでの性能は低く、Gemini-3-Flashが5択で45.96%、オープンソースは40%未満にとどまっており、知覚中心の長期ビデオ推論が依然ボトルネックであることを示唆しています。

Abstract

We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perception-centric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning. The benchmark contains 1,114 highly complex questions on 279 videos from diverse domains including city walk tours, indoor villa tours, video games, and extreme outdoor sports, with 100% manual annotation. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. These results suggest that perception-centric long-horizon video reasoning remains a major bottleneck, and we hope PerceptionComp will help drive progress in perceptual reasoning.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 3/30DailyView insight →

Black Hat Asia

AI Business

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Simon Willison's Blog

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

Dev.to

I missed the "fun" part in software development

Dev.to

The Billion Dollar Tax on AI Agents

Dev.to

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

Beyond the Chatbot: Engineering Multi-Agent Ecosystems in 2026

I missed the "fun" part in software development

The Billion Dollar Tax on AI Agents

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer