UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs

arXiv cs.CV / 4/20/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

Key Points

  • The paper introduces UniEditBench, a unified benchmark designed to fairly evaluate both image and video editing models under a shared protocol across different paradigms.
  • It defines detailed taxonomy and operation coverage—nine image operations and eight video operations—including challenging tasks like counting and spatial reordering.
  • Because existing automatic metrics often diverge from human preferences and directly using large multimodal LLM evaluators is too costly, the authors distill a high-capacity MLLM judge into smaller 4B/8B evaluators.
  • The distilled evaluators deliver multi-dimensional scoring (e.g., structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency for videos) and show strong agreement with human judgments while greatly reducing evaluation cost.
  • UniEditBench and the associated reward models are released publicly for reproducible benchmarking of modern visual editing methods.

Abstract

The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.

UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs | AI Navigate