SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

arXiv cs.CV / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces SLVMEval, a synthetic meta-evaluation benchmark designed to test how well text-to-long-video (T2V) evaluation systems measure video quality up to roughly 3 hours (10,486 seconds).
It uses a pairwise comparison framework with controlled degradations across 10 aspects, generating “high-quality vs low-quality” video pairs from dense video-captioning datasets.
Crowdsourcing is used to keep only degradation cases that are clearly perceptible to humans, ensuring the benchmark reflects what humans can reliably judge.
In experiments, humans choose the better long video with 84.7%–96.8% accuracy, while existing evaluation systems underperform human judgment in 9 of the 10 aspects, indicating reliability gaps.
The results highlight that current T2V evaluation pipelines may not yet reliably rank long-form video quality, especially across multiple quality dimensions.

Abstract

This paper proposes the synthetic long-video meta-evaluation (SLVMEval), a benchmark for meta-evaluating text-to-video (T2V) evaluation systems. The proposed SLVMEval benchmark focuses on assessing these systems on videos of up to 10,486 s (approximately 3 h). The benchmark targets a fundamental requirement, namely, whether the systems can accurately assess video quality in settings that are easy for humans to assess. We adopt a pairwise comparison-based meta-evaluation framework. Building on dense video-captioning datasets, we synthetically degrade source videos to create controlled "high-quality versus low-quality" pairs across 10 distinct aspects. Then, we employ crowdsourcing to filter and retain only those pairs in which the degradation is clearly perceptible, thereby establishing an effective final testbed. Using this testbed, we assess the reliability of existing evaluation systems in ranking these pairs. Experimental results demonstrate that human evaluators can identify the better long video with 84.7%-96.8% accuracy, and in nine of the 10 aspects, the accuracy of these systems falls short of human assessment, revealing weaknesses in text-to-long-video evaluation.

💡 Insights using this article

This article is featured in our daily AI news digest — key takeaways and action items at a glance.

📅 4/1DailyView insight →

Black Hat Asia

AI Business

Knowledge Governance For The Agentic Economy.

Dev.to

AI server farms heat up the neighborhood for miles around, paper finds

The Register

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Dev.to

Does the Claude “leak” actually change anything in practice?

Reddit r/LocalLLaMA

SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation

Key Points

Abstract

💡 Insights using this article

Related Articles

Black Hat Asia

Knowledge Governance For The Agentic Economy.

AI server farms heat up the neighborhood for miles around, paper finds

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm

Does the Claude “leak” actually change anything in practice?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer