Omni IIE Bench: Benchmarking the Practical Capabilities of Image Editing Models

arXiv cs.CV / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Omni IIE Bench is introduced to diagnose the editing consistency of image editing models across tasks with varying semantic scales for practical applications.
The benchmark uses a dual-track diagnostic design: Single-turn Consistency with shared-context task pairs and Multi-turn Coordination involving continuous dialogue tasks across semantic scales.
It is built through a rigorous multi-stage human filtering process, with quality validation by computer vision graduate students and industry relevance review by professional designers.
The authors evaluate 8 mainstream IIE models and find a prevalent performance degradation when moving from low-semantic-scale to high-semantic-scale tasks.
Omni IIE Bench provides diagnostic tools and insights intended to drive the development of next-generation, more reliable and stable IIE models.

Abstract

While Instruction-based Image Editing (IIE) has achieved significant progress, existing benchmarks pursue task breadth via mixed evaluations. This paradigm obscures a critical failure mode crucial in professional applications: the inconsistent performance of models across tasks of varying semantic scales. To address this gap, we introduce Omni IIE Bench, a high-quality, human-annotated benchmark specifically designed to diagnose the editing consistency of IIE models in practical application scenarios. Omni IIE Bench features an innovative dual-track diagnostic design: (1) Single-turn Consistency, comprising shared-context task pairs of attribute modification and entity replacement; and (2) Multi-turn Coordination, involving continuous dialogue tasks that traverse semantic scales. The benchmark is constructed via an exceptionally rigorous multi-stage human filtering process, incorporating a quality standard enforced by computer vision graduate students and an industry relevance review conducted by professional designers. We perform a comprehensive evaluation of 8 mainstream IIE models using Omni IIE Bench. Our analysis quantifies, for the first time, a prevalent performance gap: nearly all models exhibit a significant performance degradation when transitioning from low-semantic-scale to high-semantic-scale tasks. Omni IIE Bench provides critical diagnostic tools and insights for the development of next-generation, more reliable, and stable IIE models.