SLVMEval: Synthetic Meta Evaluation Benchmark for Text-to-Long Video Generation
arXiv cs.CV / 4/1/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SLVMEval, a synthetic meta-evaluation benchmark designed to test how well text-to-long-video (T2V) evaluation systems measure video quality up to roughly 3 hours (10,486 seconds).
- It uses a pairwise comparison framework with controlled degradations across 10 aspects, generating “high-quality vs low-quality” video pairs from dense video-captioning datasets.
- Crowdsourcing is used to keep only degradation cases that are clearly perceptible to humans, ensuring the benchmark reflects what humans can reliably judge.
- In experiments, humans choose the better long video with 84.7%–96.8% accuracy, while existing evaluation systems underperform human judgment in 9 of the 10 aspects, indicating reliability gaps.
- The results highlight that current T2V evaluation pipelines may not yet reliably rank long-form video quality, especially across multiple quality dimensions.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles

Black Hat Asia
AI Business

Knowledge Governance For The Agentic Economy.
Dev.to

AI server farms heat up the neighborhood for miles around, paper finds
The Register

Paperclip: Công Cụ Miễn Phí Biến AI Thành Đội Phát Triển Phần Mềm
Dev.to
Does the Claude “leak” actually change anything in practice?
Reddit r/LocalLLaMA