EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization

arXiv cs.CV / 4/10/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies three common, systematic failure modes in VLM-generated image-editing instructions—orientation inconsistencies, viewpoint ambiguity, and missing fine-grained attribute details—and reports that over 47% of baseline VLM instructions contain critical errors for downstream training.
It proposes EditCaption, a scalable two-stage post-training pipeline that first constructs a 100K supervised fine-tuning (SFT) dataset using automatic annotation plus EditScore filtering and human refinement focused on spatial/directional/attribute accuracy.
In the second stage, the method collects 10K human preference pairs specifically targeting the three failure modes and applies Direct Preference Optimization (DPO) to improve alignment beyond SFT.
Experiments on Eval-400, ByteMorph-Bench, and HQ-Edit show fine-tuned Qwen3-VL variants outperform open-source baselines, with the 235B model achieving strong benchmark results and substantially reducing critical errors (47.75% → 23%) while increasing correctness (41.75% → 66%).
Overall, EditCaption presents a practical route to producing high-quality, human-aligned instruction synthesis data for scaling instruction-guided image editing models.

Abstract

High-quality training triplets (source-target image pairs with precise editing instructions) are a critical bottleneck for scaling instruction-guided image editing models. Vision-language models (VLMs) are widely used for automated instruction synthesis, but we identify three systematic failure modes in image-pair settings: orientation inconsistency (e.g., left/right confusion), viewpoint ambiguity, and insufficient fine-grained attribute description. Human evaluation shows that over 47% of instructions from strong baseline VLMs contain critical errors unusable for downstream training. We propose EditCaption, a scalable two-stage post-training pipeline for VLM-based instruction synthesis. Stage 1 builds a 100K supervised fine-tuning (SFT) dataset by combining GLM automatic annotation, EditScore-based filtering, and human refinement for spatial, directional, and attribute-level accuracy. Stage 2 collects 10K human preference pairs targeting the three failure modes and applies direct preference optimization (DPO) for alignment beyond SFT alone. On Eval-400, ByteMorph-Bench, and HQ-Edit, fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%. The work offers a practical path to scalable, human-aligned instruction synthesis for image editing data.