WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

arXiv cs.CV / 5/6/2026

💬 OpinionModels & Research

共有:

Key Points

WorldJen proposes an end-to-end, multi-dimensional benchmark to better evaluate generative video models beyond pixel-fidelity metrics and distribution-focused scores.
It replaces binary VQA with Likert-scale questionnaires graded by a vision-language model (VLM) that assesses videos at native resolution, aiming to capture semantic and temporal quality more reliably.
To reduce evaluation costs and avoid single-dimension prompts, WorldJen uses adversarially curated prompts designed to stress up to 16 quality dimensions in one go.
The benchmark is grounded in a blind human preference study (2,696 pairwise annotations across 50 prompts × 6 state-of-the-art models) producing a Bradley–Terry three-tier human ground truth.
A VLM-as-a-judge engine reproduces the human three-tier structure with near-perfect rank correlation (Spearman ρ̂=1.000), and ablation studies confirm the robustness of the evaluation setup.

Abstract

Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, A blind human preference study is conducted, accumulating (2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts

\times

6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley-Terry (BT) rating with a three-tier structure. Second, A VLM-as-a-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman

\hat{\rho}=1.000,~p=0.0014

that is interpreted as tier agreement with the human results. Six focused ablation studies validate the robustness of the VLM evaluation framework.

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

MarkTechPost

Solidity LM surpasses Opus

Reddit r/LocalLLaMA

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

Reddit r/LocalLLaMA

We measured the real cost of running a GPT-5.4 chatbot on live websites

Reddit r/artificial

AI ecosystems in China and US grow apart amid tech war

SCMP Tech

WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

Key Points

Abstract

Related Articles

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Solidity LM surpasses Opus

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...)

We measured the real cost of running a GPT-5.4 chatbot on live websites

AI ecosystems in China and US grow apart amid tech war

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer