SF20K Competition 2025: Summary and findings

arXiv cs.CV / 5/5/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The SF20K Competition 2025, run alongside the SLoMO Workshop at ICCV 2025, focused on story-level video understanding via an open-ended video question-answering task using amateur short films.
Models were evaluated on the SF20K-Test benchmark (95 movies, 979 QA pairs) with an automated judging approach (LLM-QA-Eval) powered by GPT-4.1-nano.
The competition drew 22 teams and 286 submissions, with a Main Track (unrestricted model size) and a Special Track (models under 8B parameters); the top team reached 65.7% and 48.7% accuracy respectively.
Key findings show that narrative-aware, shot-level processing beats uniform frame sampling, multi-stage pipelines with smaller models can rival far larger end-to-end models, and subtitle quality is a major performance driver.
The results suggest the main bottleneck in long-form video QA is information selection and reasoning structure rather than raw model capacity, and there remains a large gap to human-level narrative comprehension.

Abstract

This report presents the results and findings of the first edition of the Short-Films 20K (SF20K) Competition, held in conjunction with the SLoMO Workshop at ICCV 2025. The competition is designed to advance story-level video understanding beyond short-clip action recognition, introducing an open-ended video question-answering task built on a corpus of amateur short films. This setup ensures that models must rely on multimodal understanding rather than memorization of popular movies. Evaluation is conducted using the SF20K-Test benchmark (95 movies, 979 question-answer pairs) and scored via LLM-QA-Eval, an automated judge based on GPT-4.1-nano. The competition attracted 22 teams and 286 submissions across two tracks: a Main Track with unrestricted model size and a Special Track limited to models under 8 billion parameters. The winning team achieved 65.7% accuracy on the Main Track and 48.7% on the Special Track, against a human performance ceiling of 91.7%. Our analysis reveals several key findings: narrative-aware, shot-level processing consistently outperforms uniform frame sampling; well-designed multi-stage pipelines using smaller models can match or exceed end-to-end inference with models over 30x larger; and subtitle quality is a dominant factor in performance. These results highlight that the primary bottleneck in long-form video QA lies in information selection and reasoning structure rather than raw model capacity, and that a substantial gap remains between current methods and human-level narrative comprehension.