VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

arXiv cs.CL / 4/20/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces VEFX-Dataset, a human-annotated large-scale dataset for instruction-guided video editing with 5,049 examples across multiple categories and quality labels in three decoupled dimensions.
It proposes VEFX-Reward, a dedicated reward model that evaluates video editing quality by jointly analyzing the source video, the editing instruction, and the edited result.
The work releases VEFX-Bench, a curated benchmark of 300 video-prompt pairs to enable standardized comparisons between different video editing systems.
Experiments indicate VEFX-Reward better matches human judgments than generic vision-language model judges and prior reward models, and it helps benchmark both commercial and open-source editors.
Benchmark results reveal a continuing gap in current systems between visual plausibility, instruction-following, and edit locality (locality/edit exclusivity).

Abstract

As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.

Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]

Reddit r/MachineLearning

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

Dev.to

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting

Dev.to

A Claude Code hook that warns you before calling a low-trust MCP server

Dev.to

Waiting Qwen3.6-27B I have no nails left...

Reddit r/LocalLLaMA

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Key Points

Abstract

Related Articles

Awesome Open-Weight Models: The Practitioner's Guide to Open-Source LLMs (2026 Edition) [P]

The Mythos vs GPT-5.4-Cyber debate is missing the benchmark

Beyond the Crop: Automating "Ghost Mannequin" Effects with Depth-Aware Inpainting

A Claude Code hook that warns you before calling a low-trust MCP server

Waiting Qwen3.6-27B I have no nails left...

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer