Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach

arXiv cs.CV / 3/23/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

TIEdit introduces a benchmark for evaluating text-guided image editing across perceptual quality, alignment with instructions, and content preservation, using 512 source images and 5,120 edited images from 10 TIE models.
The study collects 307,200 raw subjective ratings from 20 experts to derive 15,360 mean opinion scores across three evaluation dimensions, underscoring the need for reliable human-aligned benchmarks.
EditProbe, an LLM-based evaluator, uses intermediate-layer representations from multimodal LLMs to better capture semantic and perceptual relationships between source images, editing instructions, and edited results.
Results show that widely used automatic metrics poorly correlate with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception.
Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.

Abstract

Evaluating text-guided image editing (TIE) methods remains a challenging problem, as reliable assessment should simultaneously consider perceptual quality, alignment with textual instructions, and preservation of original image content. Despite rapid progress in TIE models, existing evaluation benchmarks remain limited in scale and often show weak correlation with human perceptual judgments. In this work, we introduce TIEdit, a benchmark for systematic evaluation of text-guided image editing methods. TIEdit consists of 512 source images paired with editing prompts across eight representative editing tasks, producing 5,120 edited images generated by ten state-of-the-art TIE models. To obtain reliable subjective ratings, 20 experts are recruited to produce 307,200 raw subjective ratings, which accumulates into 15,360 mean opinion scores (MOSs) across three evaluation dimensions: perceptual quality, editing alignment, and content preservation. Beyond the benchmark itself, we further propose EditProbe, an LLM-based evaluator that estimates editing quality via intermediate-layer probing of hidden representations. Instead of relying solely on final model outputs, EditProbe extracts informative representations from intermediate layers of multimodal large language models to better capture semantic and perceptual relationships between source images, editing instructions, and edited results. Experimental results demonstrate that widely used automatic evaluation metrics show limited correlation with human judgments on editing tasks, while EditProbe achieves substantially stronger alignment with human perception. Together, TIEdit and EditProbe provide a foundation for more reliable and perceptually aligned evaluation of text-guided image editing methods.

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Dev.to

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Dev.to

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

Dev.to

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

Dev.to

Evaluating Image Editing with LLMs: A Comprehensive Benchmark and Intermediate-Layer Probing Approach

Key Points

Abstract

Related Articles

How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

What CVE-2026-25253 Taught Me About Building Safe AI Assistants

Day 52: Building vs Shipping — Why We Had 711 Commits and 0 Users

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer