EditCaption: Human-Aligned Instruction Synthesis for Image Editing via Supervised Fine-Tuning and Direct Preference Optimization
arXiv cs.CV / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper identifies three common, systematic failure modes in VLM-generated image-editing instructions—orientation inconsistencies, viewpoint ambiguity, and missing fine-grained attribute details—and reports that over 47% of baseline VLM instructions contain critical errors for downstream training.
- It proposes EditCaption, a scalable two-stage post-training pipeline that first constructs a 100K supervised fine-tuning (SFT) dataset using automatic annotation plus EditScore filtering and human refinement focused on spatial/directional/attribute accuracy.
- In the second stage, the method collects 10K human preference pairs specifically targeting the three failure modes and applies Direct Preference Optimization (DPO) to improve alignment beyond SFT.
- Experiments on Eval-400, ByteMorph-Bench, and HQ-Edit show fine-tuned Qwen3-VL variants outperform open-source baselines, with the 235B model achieving strong benchmark results and substantially reducing critical errors (47.75% → 23%) while increasing correctness (41.75% → 66%).
- Overall, EditCaption presents a practical route to producing high-quality, human-aligned instruction synthesis data for scaling instruction-guided image editing models.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



